Keep CI small

Test selection in the CI pipeline means automatically executing only those tests that actually cover the changed source code for each commit. The tool for this analyses code coverage at individual test level and prioritizes the selection so that the execution time remains under five minutes. The goal is a total pipeline runtime of no more than ten minutes for around 6,000 test cases.

Key Takeaways

The CI pipeline at Dolby is hard limited to 10 minutes, of which a maximum of 5 minutes is allocated to test execution, the rest to build, environment and unit testing.
TeamScale automatically prioritizes the returned tests according to proximity to the changed source code, error frequency and timeliness, so that the most relevant tests are executed first.
PyTest parameterization quickly generates 50 to 100 test cases from a single test function, which is why not all 6,000 test cases fit into the CI and a time-based selection becomes necessary.
The integration of PyTest and TeamScale took two to three weeks because there was initially no native support for Python and a separate plugin had to be written for test-accurate code coverage capture.
The code coverage upload runs nightly so that TeamScale is no more than one day behind the current status and can be updated manually for individual branches if necessary.

Why fast CI pipelines fail with growing test volumes

A CI pipeline should provide feedback in minutes, not hours. This is exactly what becomes difficult as soon as the number of test cases grows. Many teams are familiar with the final stage: 6,000 tests, an overnight runtime, and during the day all that remains is the hope that the most important things are already running.

Lars Kempe, QA Lead at Dolby, describes the core of the problem using the company’s own test landscape. The team automates 100 percent with PyTest and builds libraries that are delivered to customers. The full suite with around 6,000 test cases runs for around one and a half hours at night under Linux. That’s too long for a CI.

The real driver of the test set is parameterization. A single test function quickly becomes many test cases via different configurations. For audio, the range extends from mono to stereo and 5.1 to Dolby Atmos with additional height channels. If you combine this with different bit rates and frame rates, one test quickly results in 50 to 100 test cases.

Manual test selection expires as soon as no one maintains it

If you select tests manually using markers, you have to constantly update this selection. This is the inconvenient truth behind every fast pipeline that wants to get by without tools.

In practice, maintenance falls by the wayside. You would have to sit down every two weeks and adapt the markers to the newly added code. Instead, a basic selection is run that nobody knows for sure whether it really covers the new changes.

This leads to a false sense of security. The pipeline is green because “at least the most important things” are running. Whether the newly changed code is included remains open. It was precisely at this point that the team decided against manual maintenance and in favor of tool-supported selection.

How a test-based selection works with TeamScale

The approach stands and falls with one piece of information: Which test covers which source code? TeamScale needs this connection in order to select the appropriate tests for a code change.

The basics are quickly set up. Git integration is completed in around ten minutes. TeamScale then knows the branches, the commits and the code base. The time-consuming part is the link between tests and covered code.

This link is created via code coverage at the level of the individual tests. The team instruments the binaries with GCOV and measures the coverage. The problem with the overall coverage is that it does not show which individual tests have touched which lines. You would have to manually execute each test individually, write away the coverage, reset everything and start the next test.

The PyTest plugin as a bridge to coverage

The solution was a separate PyTest plugin that intervenes in the PyTest hooks. A hook is triggered after every single test execution, not just after every test function. At this point, PyTest already provides the test name, the status (pass, fail, skip) and the runtime.

The missing information is the covered code. To do this, the plugin evaluates the log files from GCOV and reads out the covered files including line numbers. This data ends up in a log file in the format that TeamScale expects and is uploaded.

This fine-grained integration took two to three weeks. TeamScale was initially heavily focused on Java and did not directly support the PyTest environment. The connection was therefore made via the API, in close coordination with the provider.

The database is updated nightly, the code continuously

Source code and new tests arrive automatically in TeamScale, while the coverage data is updated once a night. This separation is a conscious decision with a clear consequence.

Thanks to the Git integration, TeamScale always knows which code and which tests exist. The test-to-code mapping, on the other hand, is only created during the nightly run. This results in an offset of a maximum of one day: a new test that is created during the day in parallel with the code is only included in the selection on the following day.

For longer or larger branches, the update can be triggered manually. An update to the branch is sufficient, after which the assignment is correct again. In everyday use, the one-day delay is not critical because the crucial mechanism is automatic: if someone changes code, the corresponding tests are selected the next day.

Time is the hard selection criterion, not the test quantity

The pipeline selects tests based on availability rather than quantity. The limit is five minutes of pure test execution time, embedded in a total CI of ten minutes.

When a code change is made, TeamScale returns the relevant tests already prioritized. The prioritization takes into account how close a test is to the changed code, how often it has previously failed and how new it is. A single change can still return 800 tests thanks to the parameterization.

Nobody wants to run all 800 of these tests. Initially, the team added up the runtimes of the prioritized tests and cut off after five minutes. Later, an API function was added on request, to which the available time was directly added. TeamScale then returns exactly the tests that fit into this time window.

The time limit prevents the system from reverting to the initial state. Without it, a major change could, in the worst case, drag thousands of tests back into the pipeline.

This is what the process in the pipeline looks like

Build and test preparation run in parallel, saving valuable minutes. While the build is running, the testing environment is installed and the tests are queried by TeamScale.

The query itself is an API request via the commit ID. A small hurdle: In the API version used, TeamScale does not expect the name of the commit ID, but its timestamp. The conversion is done by a Git command as a one-liner.

TeamScale responds with a JSON file that can be converted directly into a Python dictionary and contains the test runtimes. The team uses a PyTest plugin to generate a text file from the selected test cases, which is passed via the command line. PyTest then executes exactly these tests, parallelized via corresponding plugins.

Once the build is complete, the tests run in a maximum of five minutes. With the surrounding tools, the entire CI takes around ten minutes. If it fails, the triggering developer sees the status in GitLab and receives an email. With a team of five to six people, the cause is usually found quickly, otherwise you look into it together.

A small team can shape its tool instead of just operating it

The fact that the integration was completed in just a few weeks has to do with the size of the team. Those who keep architecture and implementation in one hand make faster progress.

Lars describes his dual role as QA Lead, who also does the programming and QA work. In large organizations, test architecture and implementation are often separated into separate roles, which has its own advantages. The proximity to the system was helpful for the fast connection: sometimes two hours of pair programming were enough.

The open tool chain was also helpful. PyTest and Python come with many plugins, and there was already a suitable solution for generating the test list from a text file. Where TeamScale did not cover the PyTest world, it was possible to upgrade via the API, in some cases directly with new functions from the provider.

When I do this, I want to do it properly. We start with a fast CI from the beginning, even if we don’t have that many tests or code yet.
Lars Kempe

What’s next on the list

The CI is considered solved and has been running stably for around a year and a half without any major changes. The next lever lies in the merge requests, which should allow more tests to run.

At the same time, the team wants to roll out TeamScale in other projects that are not yet using the tool. The goal there is the same as in the initial project: to optimize the test selection and keep the feedback fast.

Test data is not a bottleneck in this environment. The input files are available and are retrieved live from the test, in around 80 percent of cases it is audio. What you have to pay attention to is the length: decoding a PCM file of four minutes also takes about four minutes. With a five-minute budget, every runtime counts.