Machine learning in software testing refers to the use of algorithms to take over recurring analysis tasks in test automation. Failed tests are automatically grouped according to the cause of the error, large test portfolios are visualized as a status graph and prioritized based on risk. This frees up time for teams to write new tests instead of just maintenance.
Key Takeaways
- Failed automated tests can be condensed to a few common causes using machine learning, so that testers no longer have to analyze hundreds of individual cases every day, but only seven to eight error classes.
- A graph model of the application is automatically created from existing log files and test execution data, making redundancies, error hotspots and areas that have not been tested in an integrated manner visible at a glance.
- The selection of relevant tests for a limited time box works via weighted risk factors such as criticality, code changes, traceability to bugs and time of last execution.
- Visual regression testing via AI recognizes unexpected state changes in the interface purely on the basis of screenshots, without the need for explicit validations to be defined in the test case.
- The approach of retaining a human quality guardian in the safety-critical area is based on the fact that machines only understand the context that is explicitly passed to them.
Why every second test automator is stuck in a maintenance quagmire
Anyone who consistently automates runs up against a scaling limit at some point. The testers are there, the framework is in place, everything follows the rules of the art, and yet every working day begins with a pile of failed test cases that need to be reviewed.
The pattern is always the same. A developer changes a feature on a UI element, the next morning there are 50 red tests in the report, and the automation engineer spends his time reworking instead of running new tests.
At some point, the ratio tips completely: maintenance eats up all the capacity. New tests are no longer created, the portfolio is just kept alive and green. It is precisely at this point that test automation becomes a burden instead of a lever.
The largest single item in this maintenance effort is working through changed and failed tests. This is where most machine learning approaches to testing come in, because this is where the pain is most tangible.
How machine learning reduces failed tests to just a few causes
The core idea: instead of looking through 200 individual red tests, you get seven or eight causes of failure. You work through the causes, not each individual test case.
The basis for this are log files that already exist. The approach deliberately builds on existing data instead of generating tens of thousands of new data records or laboriously labeling them. It is essentially big data on test results.
Which tool the log files come from is of secondary importance for the algorithm. A simple adapter converts commercial tools and open source frameworks into a common format. The only decisive factor is that it is possible to read out where a test starts, where a test step begins and ends, where the error is and which error message and stack trace are attached to it.
From this technical information, the machine classifies the cause: a database problem, a UI problem, a problem with the test automation framework, a network problem. Because log files are technical and not natural language, simple procedures are often sufficient. A random forest is usually sufficient, only occasionally a neural network is needed.
How to start the classification without much lead time
You start manually and on a small scale. For each failure, you enter what it is. This is good practice for analyzing weak points anyway and hardly costs any additional effort.
Larger quantities can be covered using regular expressions. If the log file contains “database error”, a regular expression automatically classifies all corresponding cases as data problems. Whether this is correct in every single case remains to be seen, but it generates enough training data quickly enough.
The algorithm trains on these classifications and goes into test mode. During the next run, it suggests a cause for known patterns, with a reliability of around 95 percent. In uncertain cases, it holds back and leaves the assessment to the human to confirm or refute.
Clusters also give structure to the rest
Unsupervised learning helps with the failures that cannot yet be sorted. Instead of guessing a cause, the process groups similar cases purely on the basis of their technical information.
What connects these clusters is initially unknown. But the cases are similar, and the finished cluster only needs to be given a name in order to classify these failures as well. In this way, the concept can be further developed independently.
A side effect turned out to be just as valuable as the classification itself. Because all log files, including those of the tested systems, are stored in a central platform, a complete overview of what happened during a test run is created. This metadata makes it much easier to analyze the few remaining failures.
Test-based modeling: A model of the application is created from test data
Large test portfolios become comprehensible when they are represented as a graph. With 10,000 automated tests, no coverage metric answers the question of what these tests actually do.
The trick lies in the structure that test automation has anyway. Whether behavior-driven development with Cucumber, keyword-based approaches or other classic structures: every test case is a sequence of technical test steps.
This sequence becomes a graph. Each state of the application is a node, each test step is an edge. Each executed test case results in a path, and when superimposed, a graph of functional actions is created. These paths can be drawn directly from the log files or the test tool.
The result is a reversal of the classic approach. Instead of model-based testing from a pre-built model, test-based modeling is created here: a model of the application is implicitly created from the real tests.
The human eye quickly recognizes patterns in this graph. Frequently visited nodes are displayed larger, failed steps are colored red. This makes concentrations, redundancies and error hotspots visible. Even if the graph breaks down into two components, this is noticeable, an indication of areas that have never been tested together in an integrated manner.
How new tests and test data are created from the model
The model can not only be viewed, but also queried. You can ask whether there is a test case that connects two specific nodes and derive new tests from the graph.
Because the individual test steps are already automated, boilerplate code for a new test can at least be generated from a node sequence. Validations and test data still need to be added, but the framework is in place.
Large language models come into play for the test data. A model such as a GPT provides plausible values: a matching user name or an invalid login that a real user would also encounter.
The reason why this works: These models have learned their assumptions from natural language text. They make assumptions that humans would probably also make. The approach thus moves away from purely specification-based testing and towards semi-automated, exploratory testing.
Not every problem needs machine learning
The goal was to solve problems, not to use machine learning. This distinction characterizes several of the most useful building blocks.
The visualization of large test portfolios works surprisingly well without machine learning. Also, risk-based testing selection is essentially an optimization task, not a learning problem.
There are also building blocks such as self-healing of UI recognition features. Actually, clean UI features already belong in the development process; their absence is a quality-reducing feature. However, self-healing is definitely helpful for standard products or where this cannot be influenced.
How risk-based test selection works in a time box
Instead of a fixed hand-picked smoke test set, the approach dynamically selects the right tests. The question is no longer “which tests are in the smoke set”, but “you have 15 minutes, pick the most valuable tests”.
The basis is traceability. A test is linked to a user story, the user story is linked to a bug, the bug has been fixed. This chain can be used to derive which tests are now relevant.
The code delta makes it even more precise. If certain files have been changed since the last run and these files were created for a specific user story, the tests for this user story should be run.
Above this is a weighting, which is different for each company because it reflects the respective definition of risk. The weighting includes, among other things
- Criticality of the test, bug or associated story
- How long the test has not been run
- Code quality and unit testing coverage of the affected component
- Changes to code sections classified as critical
A business-critical component with good unit test coverage can end up carrying less risk than a less critical component without this coverage. The architecture design is deliberately kept open so that such parameters can be freely added.
With a time box, this becomes an optimization problem: Which test combination accommodates the maximum risk points in the available minutes? This is precisely the approach chosen.
Visual testing that recognizes the state of the application
A classic automated test only fails if an explicit action or validation fails. A human tests differently. They notice an inverted logo, even if this is not specified in any test case.
Implicit visual regression testing replicates this view without relying on pure pixel comparison. A trained model draws conclusions from the screenshot of an application as to what state the application is in.
The process compares this recognized state with what the test is currently expecting. If the test expects the state “logged in, search page with filled results”, but the application visually shows something else, the deviation is noticeable. The exact location of the change can be highlighted, similar to a heat map. This works surprisingly well.
Augmented testing: the machine supports instead of replacing
The guiding principle is augmented testing, not rationalization. It’s about removing the repetitive part of the work so that there is time for new tests, experiments and higher quality.
Based on the implicitly created state transition graph, it is possible to automatically check whether an application is moving within an expected behavior corridor. This does not replace professionally designed tests with a business context, but it does help to fill gaps or attack an application in a targeted manner in the sense of monkey testing.
The model is particularly practical in the test design itself. A recommendation engine indicates that a path that has just been designed already exists or that a certain combination of steps is not yet covered. For manual testing, this works like auto-completion.
Do we still need testers in the age of machine learning?
Yes, and this can be illustrated by a simple question. Imagine a reliable language model in five years’ time and the requirement to replace either all quality engineers or all developers with this machine. The machine can do both. Who are you replacing?
Well, I would still like to have a human guardian of quality. Machines only ever understand the context that you give them. And with this translation, the problem usually happens.
- Thomas Steirer
This answer is particularly clear as soon as life and limb are involved. What is needed then is someone who understands the context instead of just being presented with it.
The reason lies deeper than in the coding. The majority of errors are due to the requirements, not the implementation. As long as a human has to tell the machine what to do at the beginning, the human view of quality remains the point at which it is decided whether the right thing is being built.


