Test pyramid - a critical look

The test pyramid is not a rigid step-by-step model, but an idea: tests at different levels should build on and complement each other instead of covering requirements twice. Its core idea is to use development artifacts for more efficient testing. Whether this results in a pyramid shape depends on the project and must be decided on a project-specific basis.

Key Takeaways

The test pyramid was not invented to require as many unit tests as possible, but to build up test layers in such a way that they complement each other and no requirement is covered twice.
If you write unit tests based on the wrong abstractions, you have to adapt not only the production code but also the entire test suite when refactoring, which negates the supposed efficiency gain.
In microservice architectures, many tests written with JUnit are in fact integration testing, because they test controllers and service layers together, not individual units.
High code coverage is not proof of quality: errors often occur in the interaction of components, not within a single unit that a test covers in isolation.
Architecture decisions are too often made in a test-free space; a complete architecture justification should always explain how the system remains testable.

The test pyramid is a test idea, not a blueprint

The test pyramid solves a concrete problem: large applications could no longer be tested in a reasonable amount of time. The real core lies not in the shape and not in the rule of thumb “many unit tests at the bottom, few UI tests at the top”. It lies in two ideas that are often overlooked in everyday life.

First thought: testing is not something separate. It is not placed at the end of the finished product and is not carried out downstream by a separate team. Tests become more efficient when they use artifacts that developers have already designed anyway. This is exactly what unit tests do: they work at a lower level than a black box test and achieve a comparable result with less effort.

Second thought: The layers build on each other. It’s less about the pyramid shape and more about the test levels complementing each other instead of duplicating each other. Testing again at the top level what has already been tested at the bottom is useless. Leaving things out at the bottom and then making up for them at the top is just as nonsensical.

If you take these ideas seriously and apply them to your own project, you will rarely end up with a clean pyramid. The layers shift for good reasons. Sometimes more integration testing is needed, sometimes black box testing is enough. The form follows the project, not the other way around.

Why a memorable image replaces thinking

The test pyramid has a convenient feature: everyone knows the image, everyone recognizes it, everyone has an idea of what to do. What exactly is behind it remains unclear for most people.

This becomes a reflex. Many presentations end with the sentence that the test pyramid hovers above everything and that everything above only applies in its context. The image becomes a compulsory exercise without anyone checking whether it fits in with their own architecture.

Whether the pyramid still fits current architectures is an open question. Microservice landscapes, agile approaches and DevOps look different to the large UI applications for which the model was originally intended.

What unit testing means and what JUnit does with it

Much of what runs as a unit test is actually integration testing. Anyone writing tests against a REST API or against the controller and the service layers below it is not testing an isolated unit. They are just using a unit testing framework to write integration testing.

This is where the confusion lies. Tools such as JUnit or NUnit are called that, but have been testing other layers for a long time. Because the name says “unit”, many people no longer think about what they are actually testing.

Actually, we only use a unit testing framework to write these integration tests. The crucial point is exactly this transition: when do I write an integration testing, when do I write a unit testing, and how do I avoid testing the same requirement twice?
Ronald Brill

The practical benchmark is duplication. Do not test the same requirement at two levels. As soon as it is unclear in a project what unit testing means and what integration testing means, you test things several times without realizing it.

The lighter the services, the less there is to unit test

With lightweight services, the meaningful proportion of unit tests shrinks. The more powerful the underlying frameworks are and the more they have been tested themselves, the less remains that can be tested in isolation.

Unit tests can always be written. The question is what else they show. A typical pattern: test coverage is high, but there are still bugs in the software. If you take a closer look, the unit tester could not have found the error at all because it is not in the individual unit, but in the interaction of the parts.

This is exacerbated in a microservice landscape. The interaction no longer takes place within a module, but between the services. This is exactly where errors occur that no unit test of the individual component can uncover.

Added to this is the dead effort. Many zero checks and border cases at unit level test cases that never occur in complex software because validation has already taken place two levels above.

Unit tests have an addiction factor

Unit testing feels like security, especially for people who are not yet sure of themselves. There are metrics, you can see if you’ve done everything, you feel like you’ve built quality. You can come up with another test and show yourself that you’ve thought about it.

It seems to pay off in the code review. “I have a lot of unit tests, the coverage is good” seems like proof of quality. The effect: you spend less time on the actual task.

The actual task is a different question. Do I understand the business? Have I understood the process that my service or UI is supposed to support? These questions are harder to answer than counting up a coverage figure.

Reporting reinforces the reflex. Coverage is easy to report, and an 80 percent mark is attractive to project managers and test managers. However, a high figure says nothing about whether the tests are checking anything useful.

High abstraction takes its revenge when refactoring

Too many unit testers can slow down refactoring. If a new requirement is introduced and the system has to be rebuilt, all the unit tests are often dragged along with it. The conversion takes significantly longer because every single unit testing layer has to be touched.

This is often due to an incorrect abstraction. If the abstraction of the software is not correct, it is also not correct for the unit tests because they are at the same low level. The supposed advantage of testing close to the implementation then falls flat on its face.

The consequence is a trade-off. Putting a lot into unit testing is not automatically efficient, cost-effective or expedient. Sometimes it is the opposite.

Integration testing reverses the cost argument

The classic argument “do unit tests because they are cheaper” no longer applies everywhere. It comes from a time of large UI applications when black box testing via the UI was a real pain. Anyone working in old or small UI frameworks today still knows this: blackbox testing a WPF project is a disaster.

The situation is different with a REST service. Such interfaces can be tested easily, efficiently and cost-effectively. Frameworks such as Spring have testability built in from the outset.

This turns the argument around. Integration testing may not be quite as cheap as unit testing, but its effect is greater. You are testing a broader spectrum and not testing cases that never occur in practice. In the microservice environment, many therefore consider integration testing to be the more useful lever.

Tests build on trust between teams

Behind every test layer is an assumption of trust. Anyone testing at a higher level relies on the level below testing their part more favorably and reliably. In a personal union, this is tacitly regulated. Not in larger teams.

Then explicit rules and communication are needed. If you change or remove a unit test, you change the basis on which the integration testing above it is based. This should actually be coordinated so that the test suite remains complete from the perspective of the upper level.

It is the same mechanism between services. One service calls another service. Instead of covering every special case of the underlying service again via black box testing, you can build a chain of responsibility based on rules and trust. Each part covers a part of the quality requirements, the sum gives the delivered quality.

This is work because different teams, technologies and backgrounds come together. The further back a team is in the chain, the more difficult it is to recognize that it has a responsibility towards the teams upstream. Metrics say nothing about this. If there is a lack of trust, you fall back and test everything again, and end up with the unwieldy, far too slow test suite that the pyramid originally wanted to avoid.

How to tackle an inefficient test suite

There is no universal recipe, but there is a practical way to start: learn from every mistake instead of repeating the same mistake. Take a conscious risk. Complete coverage is not achievable anyway, mistakes stay in.

The most effective reflection takes place after a specific mistake. Why did this happen? Could a real unit test have found this error at all? This analysis happens far too rarely in practice, yet it is precisely what makes the test suite better.

The implicit assumptions become visible in this specific example. “It was clear to me that you were testing this.” “I would never have thought of that.” Such gaps do not become apparent in theory, but only when you look at a real case together.

Rigid role models have to give way for this. Testers and developers belong at the same table, ideally also architects, designers and requirements. Certain tests are easier for a developer to complete, others for a tester, the closer it gets to the black box. Both sides should learn to read each other’s test cases. Those who can read the test case understand what is being tested and know what contribution they themselves have to make.

Testability belongs in the architecture rationale. An architectural design should include how the result can be tested. In practice, testability almost always comes later, if at all. Many technical and architectural decisions are made in a test-free environment.

And sometimes it takes courage to throw things away. If a test suite goes to great lengths and still doesn’t work, it’s right to say: we’ll do it differently.