Mutation testing is a method for checking the quality of a test suite: A tool automatically changes the production code using so-called mutants, for example by replacing a greater-than with greater-equal to, and then checks whether at least one test fails. If no test fails, the mutant survives, which indicates a gap in the tests.
Key Takeaways
- Mutation testing does not check the production code, but the quality of the test suite: A deliberately introduced error in the code is considered killed as soon as at least one test fails.
- The Java framework PIT optimizes runtimes through incremental analysis: It compares hash codes of code and tests and only re-executes mutations for places that have actually changed.
- Mutation testing should not be applied to the entire code base, but rather to the core business logic, because UI or database accesses unnecessarily increase the effort.
- If you use mutation testing regularly, you will write better tests and better code over time, because the knowledge about typical mutants is already incorporated when writing the code.
What is mutation testing?
Mutation testing is a process you use to test your tests. Instead of just checking whether the production code works, the method checks whether your test suite is able to find errors at all.
The principle is over 50 years old. Richard Lipton described it in a paper in 1971. The basic idea has hardly changed since then.
The process is simple. You have a test suite that is green and you assume that everything is in order. Now you deliberately introduce an error into your code and see if the test suite finds it. If it finds it, it is at least suitable for this position. If it doesn’t find it, you have a gap.
These built-in changes are not called bugs, but mutants. Hence the name mutation testing. A mutant is a targeted change to the production code.
Which mutants are used?
The type of mutant depends in part on the programming language. In the Java world, several categories can be distinguished, some of which function independently of the language.
Conditional boundaries are a common category. A ‘greater than’ becomes a ‘greater than or equal to’, a ‘less than’ becomes a ‘less than or equal to’. The boundaries of a query shift. Other mutators negate entire conditions, for example by changing ‘equal to’ to ‘not equal to’.
Increment mutators exchange plus plus for minus minus. Arithmetic mutators replace addition with subtraction or multiplication with division.
The behavior of entire methods can also be changed. A void method that does something but returns nothing is simply not called by the mutator. For methods with a return value, zero is returned instead of an object; for primitive types, a zero or an empty string is returned.
There are hardly any limits to the imagination. If you wanted to apply all conceivable mutators to an entire code base by hand, you would be busy for a long time. This is precisely why frameworks take over this work.
How does a mutation testing tool work?
A mutation testing tool creates a mutant, runs all tests and checks whether at least one test fails. If a test fails, this is a good sign: The test suite has killed the mutant.
The language around mutation testing is martial. A mutant survives or is killed. Killing the mutant is the desired result in this context.
In the Java world, PIT is a widely used framework. It can be integrated into the build process and applies the mutators to the source code. It then logs how the code behaves. You then evaluate this log.
PIT comes with a pre-selection of mutators in various expansion stages, from a basic set to a stage that applies all the mutators supplied. You can selectively switch individual mutators on and off, because not every mutator is useful for your code and more mutators mean longer runtimes.
Mutation tests take a long time if you don’t limit them
Mutation tests can run for a very long time. For each mutant created, the entire relevant test suite is executed. If you apply this uncontrolled to the entire code base, you will block your build process.
PIT is highly optimized at this point. It writes a history with hash codes of code and tests. When re-executed, the tool checks which tests are affected by a change and only executes these. If nothing has changed, the result does not change either.
This incremental approach makes the difference for integration. If each run took hours, mutation testing could hardly be integrated into a daily build. If it only runs incrementally, you have more opportunities to integrate it into ongoing builds.
In the build process, mutation testing always comes after compilation and the normal test run. The mutation analysis only follows when all tests are green.
Start small, not with the entire code base
If you introduce mutation testing, you should not immediately unleash it on the entire source code. Start with the core business logic.
Checking UI code or database access via mutation testing is overkill at first. Frameworks such as PIT allow you to limit the analysis to specific packages or individual classes. This allows you to decide where mutation testing actually delivers value and avoids excessively long runtimes.
It is also worth starting cautiously with the number of mutators. Only switch on a few at first and work your way forward while analyzing the results in detail and adapting tests or code.
What the results tell you
Mutation testing uncovers three types of vulnerabilities. It shows whether your test data is good and whether it covers the limits. It shows what you haven’t tested yet. And it brings logic problems to light.
If a mutation in the code does not cause a test to fail, you check both sides: the tests and the code. Sometimes it turns out that the tests are green, but they check the wrong logic or miss the point.
This is where the added value comes in. You not only improve your tests, but also your code.
Equivalent mutants and their pitfalls
Equivalent mutants are a well-known problem with mutation testing. These are mutants that do not really change the logic of the code. The code behaves exactly the same after the mutation as before, and accordingly no test fails.
PIT attempts to recognize and avoid equivalent mutants in advance. They cannot be completely prevented. You have to filter out these places yourself.
They can be recognized by the basic principle: A mutant has been created, but all tests remain green. If a mutant remains alive, there is always the possibility that it is an equivalent mutant.
However, depending on the code, equivalent mutants are often rare. Sometimes the code can be rewritten so that the problem disappears. Often they are also an indication that something is wrong with the code itself: If another construct produces the same result, something may have gone wrong with the implementation.
The tool becomes a trainer for better code
The effect of mutation testing shifts with increasing use. Initially, it mainly provides new test ideas. The more often you use it, the stronger the learning effect.
After a while, you know the mutants that are used and react to them when writing the tests. Birgit Kratz describes how difficult it was for her to consciously create sample code for a presentation in which a mutant survives. If you know the tool well, you almost automatically write tests and code that hardly lets anything slip through.
This learning effect can be significantly greater in a team. In Birgit’s opinion, a joint session in which the findings of the mutation tests are reviewed is very useful.
It’s not only good for better testing, but also for better code. Birgit Kratz
In practice, however, this team approach meets with resistance. Many people react to the idea of testing their own tests by asking where it should stop. As a result, it is often only used by individual developers.
The two prerequisites for mutation testing
Before you can use mutation testing, you need two things: tests and green tests.
That sounds obvious, but it’s not. It is precisely these two factors that often fail in practice. Either there are too few tests in the first place, or the existing tests are not green.
Only when both conditions are met can mutation testing be used sensibly, either locally or as a stage in the build pipeline after the normal test run.


