Code quality, metrics and mindset for students

Teaching testing in university courses is difficult because students lack practical relevance: small, manageable exercises make the benefits of tests barely tangible. More complex projects with real infrastructure, techniques such as test-driven development and behavior-driven development as well as the conscious handling of AI-generated code create the necessary understanding.

Key Takeaways

Students only understand the point of testing when the tasks are complex enough: Simple exercise examples do not generate real motivation because the benefits are not tangible.
Test-Driven Development is particularly suitable for teaching because the red-green scheme provides a clearly verifiable structure that beginners can apply directly.
Poor software architecture makes good testing structurally impossible: if you mix domain logic and database logic, you cannot test business logic without a running database.
Code coverage metrics such as the 75% quality gate in SonarCube tempt students to write tests just to meet the number instead of for quality assurance.
Those who use AI-generated code without understanding it lose control over what is running in the system, which becomes a direct problem in safety-critical domains such as banking or air traffic control.

Why testing is so difficult to teach in the classroom

Teaching students about testing often fails because of a simple problem: The meaning is not tangible as long as the tasks remain trivial. If you write a simple function that you have thought about for a long time and that you know works, you see no reason to test it.

Kai Renz, Professor of Software Engineering at Darmstadt University of Applied Sciences for eight years, describes this hurdle as twofold. At the beginning, many people can’t even program properly. Writing tests then seems doubly abstract because the foundation is missing. Once this first hurdle has been overcome, the next one comes: small, straightforward examples do not provide a credible reason for testing.

The classic argument does not work in this phase. The suggestion that testing later protects against regression and uncertainty during further development in the real software industry remains anecdotal. It is a story told about a future that does not yet exist for learners.

This is exacerbated by generative AI. If ChatGPT spits out the correct solution anyway, the question of the point of testing becomes all the more urgent.

Complex projects create the occasion that trivial examples do not provide

The lever against the lack of motivation is complexity. As soon as a project becomes realistic enough, the need for testing arises by itself.

In Darmstadt, a software engineering internship simulates a pizza store. Each group operates its own Kubernetes cluster with real infrastructure, which also requires integration testing. The students are deliberately thrown in at the deep end.

The result is mixed. For some, the principle ignites, they understand why they are testing. Others mainly ask what they have to do to get the certificate. For them, the spark is missing. This division is not a failure of the method, but a reality in a heterogeneous student body.

Conceptual understanding beats tool knowledge

The most important learning outcome lies not in mastering a specific tool, but in conceptual understanding. The industry is so fast-moving that any specific tool quickly becomes obsolete.

Concrete techniques and tools are nevertheless taught, from unit testing to integration and end-to-end testing to UI testing. The focus is on the structure of a test: consistently applying the AAA pattern, clearly deciding what the system under test is in the first place. A single class? A complete functionality? What level am I currently at?

Experienced developers make these decisions almost automatically. Do I need a test database? Do I mock a framework or not? Those who know how the components interact make conscious decisions. Those who are not familiar with it need clear instructions to get started.

Test-Driven Development is particularly suitable for teaching

Test-driven development has a simple, clearly understandable scheme that is easy to follow in the lecture hall. First the test, then the functionality. The test is red first, perhaps not even compiling everything at first, followed by the code.

The most common question from students is: How do I know what I’m testing if the function doesn’t even exist yet? This leads to semi-philosophical discussions about what you need to know in order to do something.

The answer is pragmatic: if you know what you would program next, you also know what the result should be. And so you know what the test looks like. TDD helps here because it narrows the focus to a single small step towards functionality.

This is reinforced by pair programming with Driver and Navigator as well as coding sessions in the lecture hall based on the principle of mob programming. One person programs, everyone watches. This works, but remains challenging because very good and very inexperienced students sit in the same room.

Behavior Driven Development brings the user perspective into the test

Behavior Driven Development shifts the focus away from the implementation to the application and user side. Instead of asking how something is implemented, BDD asks what is to be tested and how a test case can be described.

This technique is used in an elective course called “Professional Testing”. Over fourteen weeks, participants are confronted with the entire spectrum, from backend tests to various programming languages and tools to UI testing with Playwright. The learning curve is steep, the knowledge gained broad.

The effect is evident in the profession. Former students report that they were able to apply what they had learned directly. Some bring more knowledge to companies where testing is not yet a matter of course than their colleagues on site. Fresh people are often the only way to bring quality thinking into a company that has managed without it for twenty years.

Architecture determines whether a system is testable

Good testability depends largely on the software architecture. Systems that are difficult to test usually mix things that belong separately.

The clearest example is the separation of domain logic and controller logic. Approaches such as hexagonal or functional architecture only show their strength when they are implemented consistently. The price is visible overhead: many auxiliary classes that copy objects from the database logic into the business logic.

This is exactly where students often drop out. They already have a class that is linked to the database and ask why this should be a problem. The answer becomes clear as soon as you want to test the business logic: if you have to create a database first in order to test the business logic, it is not cleanly separated.

What to do if an existing system cannot be tested

The typical trap is shock paralysis. The team is faced with a system that is difficult to test and sees no starting point. It would need refactoring, but this is hardly possible without safeguarding tests, and the tests hardly work because of the architecture. A real vicious circle.

The pragmatic approach is to start with the new. New functionality is consistently built and tested cleanly. This allows the team to experience the benefits of good architecture: easier refactoring, testing at all levels, static analysis without open findings.

This understanding seeps into the legacy system over time. With very large, old systems, this can never be fully implemented. And even in new developments, there is a risk of repeating the same mistakes. That’s why a collaborative way of thinking about quality is needed in the team.

Metrics are a means, not an end

A code coverage of 75 percent is not an end in itself. This is precisely what teams who only write tests because they want a red quality gate to turn green fail to recognize.

In the internship, SonarQube is used as a tool for static code analysis, with a quality gate that requires coverage of over 75 percent. If teams don’t achieve this at first, some simply write tests until the number is right. This tilts the mindset in the wrong direction.

The principle is: you don’t write a test so that a number turns green, but so that the quality increases. The statements on code quality, such as complexity, are more exciting than pure coverage anyway.

Non-functional testing is deliberately outsourced

Security, usability and performance can hardly be accommodated in the software engineering section because the canon of topics there is already full to the brim. Darmstadt solves this through specialized events.

IT security has its own course and a strong research group behind it, both at a technical and usability level. For usability, there is also a separate course on human-computer interaction, which asks how functionality can be implemented in a meaningful way.

Closer integration of these modules would be desirable, but there are two obstacles to this. Teachers like to work autonomously, coordination is difficult. And students would face a significantly greater workload if one module required another.

Performance tests are classified using the quadrants of agile testing, nothing more. Load testing with 500 pizza orders has nothing to do with a real system and would remain fake. Exercises should have real-world impact instead of simulating tasks that remain empty in an educational context.

AI in studies: responsibility remains with humans

Generative AI is a matter of course for the current generation of students. Teaching responds to this with clear boundaries where it counts and with openness everywhere else.

The basic programming course ends with a practical exam on a computer without internet access. Anyone who has only had the solutions generated beforehand will not be able to use the basic programming techniques in the exam. In the practical courses, on the other hand, the following applies: AI may be used, but the responsibility for what has been submitted remains with the students. They must be able to explain what they have done.

One incident illustrates the risk in concrete terms. When developing a new feature for the pizza project, the required functionality was already in the code. A commit was found in the commit history that replaced almost the entire project. A student had given the code to ChatGPT under submission stress with the request to find errors. The AI added the functionality without being asked.

If you worked at a bank or in air traffic control or for a brake manufacturer, would you be able to sleep well if you had committed code that you didn’t even know was there?
Kai Renz

The student honestly admitted the process. That’s what made the conversation so instructive. The learning effect: AI helps, but you have to know what you’re doing yourself.

Where the lasting skill lies

If you can only use ChatGPT, you don’t have the skills to convince an employer. This message is clearly conveyed to students who openly solve their assignments purely via AI.

A degree can theoretically be achieved with AI without understanding much, as long as you somehow get through the exams. But prompting alone is a short-term skill that will probably become easier again the more naturally you can speak with AI.

There are two long-term skills. The first is communication with people: really understanding what the other person wants. This is difficult to delegate. The second is control over what is spit out. Is it plausible? Do I understand it? Do I have an overview?

However, you only have an overview with experience. If you don’t have experience, you don’t have an overview. This is the real dilemma of a generation that starts with AI before it has built up the judgment to check its results.

AI does not understand the context by itself

Some framework conditions must be explicitly given to the AI because it does not take them into account on its own. Logging is a good example.

Logging frameworks, including the good structure of messages, can be taught quickly and the benefits are obvious. Then comes the counter question: Are you allowed to log everything? If someone requests the deletion of their data, but the system has logged which pages this person has viewed, can you ever get this out of the log again?

Such limits arise from data privacy, copyright and copyright law. There is also the question of where source code from a company is sent and which server is reading it. An AI does not automatically understand these implications. You have to give them to it.