Automatic test case selection for regression testing

Selecting regression test cases means automatically executing only the test cases that are actually affected by the respective change during a code check-in. This is based on a nightly mapping: Which files, classes or functions does each test case open? If one of these changes, only the appropriate tests are run. This reduces runs of several hours to often 10 to 15 minutes.

Key Takeaways

The combination of nightly test phase analysis and compile-time C++ dependency tracking enables fully automated test case selection across language boundaries without manual intervention.
A test suite with 25,000 test cases and originally 6.5 hours runtime can be reduced to 10 to 15 minutes feedback time by selection in more than 50 percent of the commits.
The granularity of the selection determines the efficiency: Selecting at DLL level is not enough, only analysis at function level brought the same breakthrough with C++ as the class level previously did with Java.
Parallelized validation over several weeks showed that not a single real test failure was overlooked by the selection, all deviations were due to infrastructure problems.
The team’s skepticism about the start of the project only disappeared when the Java part was finished and the concrete benefits were measurable.

Why long regression testing blocks development

A complete test execution of six and a half to seven hours slows down any pipeline. IVU has around 25,000 test cases that take this long to run completely. Developers who check something in do not want to wait for hours until it is clear whether the commit was clean.

The problem scales with the number of branches maintained in parallel. At IVU, there are up to ten of them: several releases are maintained by customers, and each bug fix is passed through four or five branches in case of doubt. A single check-in can therefore trigger six and a half hours of testing on several branches.

Even with ten large machines in continuous operation, there was not enough capacity. Some of the test cases had to be deliberately capped because otherwise a backlog would build up. The alternative of testing everything once a night has a price: if a test fails in the morning, the search for the commit that caused the error begins first. This costs additional analysis time.

What test case selection means for regression testing

Test case selection means selecting exactly those test cases from a large test suite that are affected by a specific code change. Instead of running all 25,000 test cases, the system only executes those that are actually affected by the check-in in question.

The selection process is fully automated at IVU. Manual intervention is not necessary. Based on the files checked in, the system decides which test cases are relevant and only triggers these.

The approach was developed in collaboration with the Technical University of Munich. Doctoral student Daniel Elsner worked on the project for three years at the chair of Professor Alexander Pretschner and investigated the optimization of automatic regression testing. The current procedure was gradually developed from several experiments.

The challenge: two languages and one database

IVU’s software mixes two major technologies. Large parts are implemented in C++, others in Java. The components build on each other: Journeys generate services, services generate the specific work instructions for employees, and individual stages of this chain are sometimes in C++, sometimes in Java.

In addition, there is a strong dependency on data in the database. Without data, testing is hardly possible in a meaningful way, and mocking everything is too complicated, especially for the edge cases. For this reason, the system imports data into a database at the start of a test run, for which the C++ components are usually used.

Which binaries are required differs for each test case. A Java program first calls a C++ binary to generate its test data and then executes actions on it, such as creating a print or activating an interface in a peripheral system. Not every Java test case needs the same C++ binaries, some don’t need any at all.

How the selection works technically

The core of the process is a link between checked-in files and the test cases that actually use these files. The first step was to record on the Java side which files are loaded for each test run.

Initially, this was not about the code level in detail, but about loaded artifacts. Which Java class is specifically activated from a JAR? Which C++ library is loaded to generate data? And even without any code: which XML, CSV or YAML file is opened, for example to define a test oracle or make settings? Such files also influence test execution.

There are two pipelines for this. The first builds and tests continuously. The second runs regularly, usually at night, executes all test cases and logs which files are opened in the process. From this collected data, it is possible to immediately deduce which test cases affect a changed file at the next check-in.

C++ libraries are a special case, as it is not the DLLs that are checked in, but the source and header files. The C++ compile itself helps here: During compilation, information is generated about which files go into which DLL. This link is also created so that the affected test case can be derived from each change.

The data collected nightly forms a kind of index. During the actual selection, only this index is looked at, not analyzed again. This means that the selection remains fast during test execution.

The results: from hours to minutes

The selection saves 50 to 60 percent of all test cases. A development branch and an already released branch with pure bug fixes were examined.

On the Java side, the runtime dropped significantly. Previously it was two and a half hours, now it is one hour on average. In many cases, developers receive feedback after 10 to 15 minutes as to whether their commit is OK.

The jump on the C++ side was even greater after a second expansion stage. Instead of selecting only at DLL level, the system now checks within the DLLs at function level which C++ function is actually called. If a function has changed, only the test cases that go through this function are run. The C++ side thus fell from around four to four and a half hours previously to usually 10 to 15 minutes.

The long full runs do not disappear completely, they become less frequent. In the case of major changes to the core, all necessary test cases continue to run, and then the two and a half hours are acceptable again. Especially in released branches whose bug fixes are sent to customers shortly afterwards, you want the security of a full run.

	Before	After (on average)
Java side	2.5 hours	1 hour, often 10-15 minutes
C++ page (function level)	4-4.5 hours	10-15 minutes
Saved test cases		50-60 percent

Trust comes from validation, not from promises

For a selection procedure to be accepted, it must prove that it does not overlook any real errors. The entire suite was run in parallel for four to six weeks, while a record was kept of what the selection would have chosen.

The result: not a single missed error. The few errors that only the full execution found were exclusively due to the infrastructure, such as a temporarily unavailable database due to network problems. Such interruptions do not count as overlooked test failures.

This validation is the basis of confidence in the selection. Without the reliable comparison against the full run, there would always be doubt as to whether the system is filtering out the right test cases.

Faster feedback changes the behavior of developers

Small commits pay off when the selection takes effect. If you check in fewer commits at once, you change less code and thus trigger a smaller number of tests. The feedback comes back faster. On the other hand, if you collect over days and check everything in at once, you start another large suite.

The most noticeable reaction was the disappearance of complaints. Previously, there was a lot of criticism about the slow, non-functioning pipeline. These complaints have disappeared, even if exuberant praise rarely arrives directly.

What is more remarkable is how the mood in the project changed. Before the launch, some doubted whether it was worth the effort because previous attempts to analyze the source code in more detail had failed due to its size and complexity.

When we had finished the Java part, there were suddenly a lot of appreciative voices. I would never have thought that we would get so much benefit from this project. Silke Reimer

What can be optimized next

The biggest open lever is the prioritization of the remaining long runs. Today, the system executes all selected test cases, in no particular order. For long-running parts, it would be possible to control which test cases run first, so that probable failures become visible early on.

There are two ways to prioritize. Test coverage can be used to select each additional test case in such a way that it provides as much new coverage as possible. Alternatively, you can use historical data and execute first where something failed particularly frequently.

Technically, the Jenkins pipeline must also play its part. The run should not stop at the first error because further failures remain interesting. At the same time, the developer should receive an early warning: preliminary report immediately, full report later.

On the Java side, the same refinement step is pending as for C++. The system currently selects at Java class level. The next step would be to go down to the individual method and only execute the test cases that exactly hit the changed method.

The benchmark: millions of lines of code, nightly index

The method proves itself on a large code base. The C++ part comprises almost 10 million lines of code, the Java part around 4 million. One of the reasons for the initial doubts about the project was the skepticism that such a quantity could not be analyzed in any meaningful way.

The nightly index run, which rebuilds the links between files and test cases, takes around five to six hours. IVU has now reduced the number of pipeline machines from ten to five because there are few check-ins at night and this free capacity can be used for the index runs.

With ten branches and five machines, the system manages approximately one run per night and branch. On average, the index is therefore rebuilt about every two days, faster as soon as a machine becomes free earlier. The Java part has been running productively for around one and a half to two years, the more precise C++ analysis at function level for three to four months.