Continuous Everything - Do we need it?

Shift left and continuous everything refer to the practice of measurably improving software quality through automation, continuous integration and early error detection. Two metrics are crucial: how quickly a change reaches the customer and how many error reports are received. Both have been proven to decrease when teams gradually drive autonomy and cultural change instead of implementing centralized guidelines.

Key Takeaways

Trunk-based development measurably smoothes out defect peaks during release: Instead of large defect packages around releases, a constant, lower defect level is created during operation.
Cultural change in software development takes years, not quarters: even with targeted craftsmanship programs, it takes two to three years for new ways of working to become anchored in teams.
Autonomy beats regulation: Teams that are allowed to decide for themselves how many open defects they tolerate accept quality rules much better than teams that are forced to follow a zero-defect policy.
Two metrics are sufficient to measure software quality in a meaningful way: Lead time to customer and number of incoming defect reports reliably show whether process changes are working.
Value stream mapping before every process optimization protects against the wrong starting point: what feels like a problem is rarely the actual problem; often the lost time is spent waiting for other teams.

Continuous Everything is only useful if you measure the benefits

Anyone who introduces continuous integration, delivery and deployment will sooner or later be faced with a simple management question: does it really make a difference? Thousands of automated builds, constant test runs, continuous delivery. All of this costs money and time before an effect becomes visible.

Continuous Everything means automating as many steps of software development as possible and bringing them into a continuous flow. The conferences are full of tool manufacturers and experience reports on how this can be implemented technically. The more difficult question usually remains unanswered: Can the quality gain be proven in figures?

Marco Achtziger formulates a clear position on this. Anyone who claims that a changeover makes the software better has a duty to deliver. The good news is that the necessary data is generated anyway. Source control systems, test execution logs and ticket systems provide time-stamped information that can be evaluated over years without having to collect it yourself.

Why technical changeovers are primarily a cultural problem

Technology is the least of the problems with such changeovers. The real obstacle is habits. Developers who have worked for years with a heavyweight version control system do not voluntarily change their approach just because a new tool is available.

Switching to Continuous Everything is therefore primarily a change of mindset, not of tools. Marco Achtziger relies on Dan Pink’s motivation model with its three levers: Mastery, Purpose and Autonomy. Addressing these three points encourages people to work on their own initiative.

In this specific case, this happened through so-called craftsmanship programs. Participation was completely voluntary, i.e. autonomy. The programs were structured in levels at which skills could be visibly improved, i.e. mastery. And the purpose of each measure was explained, i.e. Purpose.

How voluntariness works when not everyone joins in

There is no such thing as a silver bullet. In every change process, the last ten percent who cannot be convinced are left behind. Marco Achtziger calls this open: There are still colleagues today who think the old version control system is better. The aim is not to get everyone on board, but to reach the majority of people.

The effective lever is the opinion leaders. In every development community, there are a few alpha animals who are heard. If you convince them and turn them into pilots, you create multipliers. They tell others what they are doing and arouse their curiosity. Curiosity turns into participation.

This path takes time. It took two to three years for a craftsmanship program to become established in the culture. In the end, almost all teams were involved, about half of them actively. Anyone expecting a cultural change that takes effect overnight will be disappointed.

How levels address specific pain points

The level structure of a craftsmanship program fulfils two tasks at the same time. It defines the expected skillset of developers and testers, and it addresses current pain points that are hurting anyway.

One example is continuous integration. It was introduced quickly, but most builds were red and nobody cared. A known effect. The levels addressed exactly that, in successive stages:

level	build requirement
Basic	A test runs at all in the build
Medium	The build runs the tests, but does not yet have to be green
Higher	The build is green most of the time, the team shows how it does it

Flaky testing was a pain point of its own. Tests that do something different depending on the phase of the moon or day of the week. An unstable test isn’t bad in and of itself. What is bad is ignoring it. A quarantine build served as a central suggestion: If a flaky test can’t be gotten green in a few hours, it moves from the normal build to quarantine. This keeps the regular build green.

Self-selected metrics have a stronger effect than predefined ones

In many levels, the programs did not specify a concrete solution, but asked the teams to develop their own suggestions. One level, for example, required metrics that the teams could use to optimize their code. The first question was predictably: Which metrics should we use?

The answer was: Tell us yourselves. Think about which metrics will help you. In the discussions, many ended up with established measures such as code coverage or McCabe complexity, but were able to explain why themselves. That was exactly the goal. If you understand the meaning of a metric, you use it sensibly instead of fulfilling a number that a central department has specified.

Flaky tests don’t go away, you learn to deal with them

A stable build is not an end in itself, and a one hundred percent green build is even suspect. Marco Achtziger puts it in a nutshell:

If a build is one hundred percent green, then congratulations, you have stable testing. The bad news is: apparently nobody is working anymore.
Marco Achtziger

Wherever people work, errors occur, and this results in a certain amount of instability in the system. Claiming to stabilize every test at all costs does not lead anywhere. It is important not to ignore instability and to find a way of dealing with it so that the majority of the builds remain green.

This learning process can be seen directly in the data. Shortly after the launch of the Craftsmanship programs, the percentage of flaky tests increased as more people started writing tests. Then, as awareness of how to deal with unstable tests grew, the number dropped again and leveled off at a low level.

Two metrics are enough to assess software quality

When it comes to assessing software quality, only two metrics are really meaningful: how long it takes for a change to reach the customer and how often the customer complains about the software.

Translated, this means lead time and customer feedback, usually in the form of defects or tickets. Both variables can be reconstructed from existing data pools and tracked over years. The effect of a changeover does not become apparent immediately, but over the course of several versions.

The effect on throughput time was clear. After switching from a branch-based approach to trunk-based development, the time it took for a change to reach the customer dropped from days to hours.

How trunk-based development smoothes the defect curve

Trunk-based development not only changes the speed, but also the shape of the defect curve. In the branch-based approach, all customers were supplied from different version branches. A defect had to be fixed across multiple branches, and it was often guesswork as to which branch point the problem was located.

In this old model, the defects occurred in peaks, always around a release. For the developers, this meant recurring shock loads that massively disrupted the normal workflow. With trunk-based development, where all customers have been supplied from the same mainline for two years, the peak curve became a constant, flat curve at a low level.

If you superimpose the curves, you not only see the smoothed shape, but also a lower level overall. The continuous load in the trunk-based model is below what remained at the end of the large release defect packages.

Why linkable data is a prerequisite for any analysis

The most important insight from the data work sounds banal: Data must be captured and linkable in the first place. Small adjustments to the tooling are often enough to merge separate data pools.

A concrete example shows the principle. A database logged which tests had been run in which build. However, a field for the version number was missing for linking to the changes in the source control system. This triviality was retrofitted so that the analysis could even be reconstructed historically.

Only this link made flaky testing detection possible. A build as a package is coarse-grained, red or green. By linking down to commit and test case level, on the other hand, it is possible to check exactly which test delivers different results for identical software versions.

How a changed mindset manifests itself in everyday life

The most visible change is not in the numbers, but in the attitude of the developers. Where previously a central test team and a classic divide passed between development and testing, the recent switch of the version control system to Git came at the request of the developers themselves. Their condition was: If continuous integration and the tests work, you are welcome to roll it out.

On the customer side, the change is reflected in the decreasing number of tickets. As a platform supplier, we receive feedback that the platform is easier to integrate and that each new version requires less customization. The number of error reports is measurably decreasing, while the platform is demonstrably still being used.

Why the zero-defect policy failed and bug jail worked

Not every experiment worked. A zero-defect policy, which stipulated that everything else had to be abandoned immediately in the event of a defect, was not well received by the teams. Marco Achtziger would never introduce it again.

The next iteration reversed the principle. Under the name Bug Jail, the teams were allowed to decide for themselves how many defects they would allow before stopping development. More conservative teams specified a maximum of two, others five. Despite this heterogeneity, the approach worked much better.

The difference lay in the autonomy. As soon as the teams thought about the threshold themselves, they took care of incoming defects earlier. The lesson from this can be applied to any change: state the goal and the problem, but leave the way to get there to the team.

Where the next lever lies: Optimize test execution

The current bottleneck is the sheer volume of test execution. A meaningful metric for this is the test execution time per real month. Ideally, this value would be one, a machine that runs all tests around the clock. In reality, the value approaches the mathematical maximum of machine capacity because many tests are prescribed in a regulated area.

The solution is a machine learning system that predicts the tests that are most likely to fail for a given source code change. This could speed up a gated check-in, for example: The run can abort at the first failing test or at a set threshold of failing tests. This reduces the number of machines required to an acceptable level.

Where you should start: value stream mapping

Every changeover is preceded by a value stream analysis from the lean environment. Before you change anything, write down the figures and see where your time is actually being lost. Experience shows: What you think is the problem at the beginning is almost never the actual problem.

It is often trivialities, such as waiting for another team, that eat up most of your time. If you have the value stream in front of you, you first address the biggest time waster that can actually be influenced.

Plan for patience and iterate. If you do things more often and faster, you will change the problem situation itself, and after a while it is worth taking another look at the old problem list. Some steps even become slower at first. Trunk-based development requires the individual developer to think more, for example about backwards compatibility.

Therefore, always consider the entire chain, not the individual steps. Just because a sub-step becomes slower for a short time does not mean that the entire chain becomes slower. On the contrary, it speeds up.