Test Intelligence

Test intelligence refers to analysis techniques that evaluate data from your own development and test process in order to find more errors in less time. Specific methods include test gap analysis (making untested changes visible before the release), test impact analysis (only executing relevant tests after a change) and test suite minimization (calculating the optimal subset from large test suites).

Key Takeaways

Test gap analysis combines version control data with coverage profiling to make untested changes visible before a release and enable informed decisions.
Coverage profilers record manual tests as seamlessly as automated tests without testers having to change their workflow.
Test impact analysis finds eighty percent of the errors that the full test suite would find in one percent of the total runtime, providing significantly faster feedback.
Historically grown test suites test too much and too little at the same time: many tests overlap heavily, while real gaps go unnoticed.
Even a well-staffed, experienced development team cannot reliably cover all changes in testing without dedicated tools, as an internal study at CQSE has shown.

What test intelligence means

Test Intelligence is a collective term for analysis techniques that evaluate data from your own development and testing process in order to find more errors in less time. The approach taps into sources that arise in every project anyway: the version control system with all changes to the code and the test coverage, measured using coverage profilers.

Elmar Jürgens includes several specific analyses that complement each other. None of them is an end in itself. The aim in each case is to provide a better basis for decision-making: what needs to be tested, what is worthwhile, what can be left out.

The common denominator is the combination of a static and dynamic view. Where a static analysis reads the code without executing it, a coverage profiler records what is actually run during a test run. Only the combination of both data sources makes the statements reliable.

How test gap analysis makes untested changes visible

The test gap analysis shows you which code changes have not yet been tested before a release. The basic assumption: in long-lived software, most errors are found in places that have changed since the last release. In a complex team, however, it is difficult to track whether every change has really been caught in the test.

This is why two data sources are superimposed. The changes come from the version control system and show what has changed since a reference point, for example since the last release or the last test run. The test coverage comes from coverage profilers, which record all test phases without gaps.

It is important to note that not only automated unit and end-to-end testing is recorded, but also manual tests. The coverage profiler does not care whether it is observing an automated or a manual test. The tester does not have to change his or her process, the recording runs in the background.

The goal is not to test everything. The goal is to make a conscious decision. Sometimes an untested change is allowed into production because the feature in question will not be needed for months. Often this is not the case, and then the analysis shows in good time where there is a gap.

Even if, for some reason, coverage profilers have mainly been used for automated tests in recent decades, we have also been doing this with customers for manual tests for over ten years. It works very well.
Elmar Jürgens

Coverage profilers work differently depending on the technology

There is a suitable profiler for every common technology, whether commercial or open source. There are roughly three categories.

Profiler on the virtual machine: used for languages such as Java, C#, Python and for SAP ABAP. They dock onto the VM and monitor what is executed from there.
Instrumenting profilers: modify the code itself by inserting their own instructions between statements to report that a job has been executed. This method is used for C and C++, where there is no virtual machine to dock to.
Hardware profiler: for embedded systems. Some processors have their own pins to which a hardware profiler docks. They provide a hardware guarantee that they do not affect the timing and that the processor behaves exactly as it would without profiling.

The choice depends on what works best for the technology in question. Some languages have their own profilers because there was no viable alternative.

Why the performance impact remains small in practice

Very fine-grained instrumentation can significantly slow down the runtime, but this fineness is not necessary for test gap analysis. It is not necessary to determine whether each individual path has been traversed in a long method. It is sufficient to know whether a method has been entered at all.

If a method is not entered at all, a single bit is sufficient. This not only generates less data, but also significantly less performance load. Benchmarks with various profilers often show a slowdown of around one percent, sometimes five percent.

This is not noticeable in manual testing. There, the system is usually waiting for the network or database anyway, and profiling does not slow things down at these points.

Even good teams do not fully cover their changes

Even disciplined development teams do not reliably manage to fully cover their changes in the test. This is not due to a lack of will, but because it is simply too complicated without dedicated tools.

A study in our own company confirms this. The initial situation was favorable: a small, stable group of trained IT specialists, plus a comprehensive code peer review in which every change is checked by a different person before it is allowed into the release. It was investigated whether it was possible to correctly cover third-party code that had just been reviewed with tests. Even under these conditions, this was not successful.

If gaps remain in a manageable, well-organized team, then they are the rule in large, historically grown systems. This is precisely where the analysis comes in, instead of relying on gut feeling.

How testing impact analysis brings back quick feedback

The test impact analysis selects exactly those tests from a large test suite that are affected by your latest changes. This drastically shortens the feedback loop.

Many teams are familiar with the problem behind this: over time, more and more automated tests are added and the overall runtime increases. One customer has 80,000 automated end-to-end tests that take around 400 hours to run one after the other. Even highly parallelized tests often take several days to produce results.

Late feedback loses its value. If something breaks today and you find out immediately, you know what the problem was. If the feedback only comes after a month, you have forgotten what you did, and there are many other changes in between that could also be to blame.

The solution: Each test case is measured once individually. After a change, it is then possible to say specifically which tests need to be run now. The results are clear: in around one percent of the total runtime, the selection finds 80 percent of the errors that the entire suite would find, in two percent of the time it is 90 percent.

A small proportion slips through, which is why the full test run is occasionally necessary. However, the majority of new errors appear very quickly. This is incorporated into the continuous integration pipeline without having to rewrite existing tests. This is particularly helpful for teams that do not have a clean test pyramid and no easy way to get there.

Historically grown test suites do too much and too little at the same time

Test suites grow like the software itself, and this leads to a paradox: a lot remains untested, while at the same time many testers test almost the same as others. Redundancy often results from copy-paste with small changes, especially in end-to-end testing.

Pareto optimization addresses this by extracting a small test suite from a large one, which finds a large proportion of the errors in a fraction of the time. This small suite is suitable as a quality gate before starting a more expensive test run.

The benefits become concrete where testing is expensive. Hardware-in-the-loop tests with machine tools cannot be parallelized at will because the expensive machine must be physically present. The software must first be good enough to be allowed to run on it. A bug that causes 1,000 out of 5,000 tests to fail will otherwise cover up all the others.

Instead of compiling an acceptance suite by hand, the optimization calculates the suite that finds the most errors in a given time window. In studies with customers, the calculated set found twice as many errors in the same time as the hand-picked selection.

This also pays off when it comes to saving resources. Anyone running tests in the cloud, for example on AWS, pays more with each execution, the more tests are added and the more frequently they are run. A smaller, targeted suite reduces these costs directly.

Acceptance is the deciding factor, not the tool

An analysis tool is only effective if the team uses it, and that requires change management. This applies equally to static, dynamic and hybrid analyses. Hybrid analyses combine static and dynamic views.

Measuring alone does not improve anything. Every team needs a loop in which it is clear what the tool can and cannot do. This includes proactively addressing concerns about personal performance monitoring before anything is measured at all. This fear often resonates in analyses.

Seamless embedding is just as important. The information should appear in the tools where work is being done anyway: in the test management tool for testers, in the IDE and in the pull request for developers. A tool that is thrown over the fence falls flat.

Providing tools, implementation and support from a single source also shortens the whispering between users and the development team. If the people in support help develop the tool themselves and use it themselves, feedback comes back more directly and quality problems are noticed first in their own product.

Where Test Intelligence is heading

The same technology that measures in testing can also be used in production to see what users are actually using. This often uncovers functionality in historically grown software that no one has needed for years.

Deleting such dead code saves twice. You save effort in testing and development. If a static analysis finds a security gap in an area that is no longer needed, deleting it is cheaper than laboriously repairing it.

Further data sources are on the roadmap. Requirement and test-smell analyses read linguistic requirements like a static analysis, for example for passive constructions or vague words. Such formulations lead to underspecified test cases that are interpreted differently by different people, resulting in non-deterministic testing. An ambiguous requirement leaves open how it will be implemented in the end.

Another field is requirements tracing. A tool that sits in all systems anyway can generate verification matrices from requirements, manual and automated test cases and code. Each pull request shows when a requirement has changed and what this means for code and tests. In this way, the matrix is not created retrospectively for an auditor, but the artifacts remain continuously consistent.