Quality assurance of AI

Testing AI means evaluating quality statistically instead of with a clear yes/no result. Classic quality characteristics such as functionality, performance or portability still apply, but are supplemented by AI-specific criteria: functional performance metrics, autonomy, transparency and ethical aspects. Methods such as metamorphic testing and pairwise testing are gaining new importance due to the lack of a clear test oracle.

Key Takeaways

Classic quality characteristics such as functionality are no longer a yes/no statement with AI, but statistical variables measured via metrics such as accuracy, precision and sensitivity.
Metamorphic testing solves the test oracle problem: if you don’t know the output, you vary the input so that the output must remain the same, thereby gaining verifiable confidence in the system.
AI systems learn from the data, not from the intention behind it: An image classifier trained on photos with timestamps learns the timestamp instead of the actual image content.
Reproducibility in AI systems is structurally difficult because training parameters are often set randomly and not every state can be recorded exactly.
Well-known test methods such as pairwise testing and A/B testing are gaining new relevance with AI because many parameter combinations and the lack of test oracles demand precisely the strengths of these methods.

Why quality in AI is no longer a yes/no question

With classic software, a test case provides a clear result: pass or fail. With AI, this clarity no longer applies. Functionality becomes a statistical variable, measured by values such as accuracy, precision and sensitivity.

Testers are trained to rate each test step as “right” or “wrong”. With AI, this intermediate evaluation is often completely absent. At the end of a run over many test steps, it may be possible to say “that was good”, but there is no hard statement for the individual steps in between.

The core of the problem has a name: the test oracle is missing. With traditional software, there is a specification that dictates how the system must work. With AI, this clear standard no longer exists. There is only how the system should normally work. Sometimes it doesn’t, and that can still be okay.

An image classifier usually recognizes an apple as an apple. It is permissible that it does not recognize it once. That’s exactly why a test case can’t be: “Put this one image in, apple must come out, checked once, fits.” Instead, it needs lots of images and the question of how many are enough to build trust.

AI is also just software, but with additional quality characteristics

The classic quality characteristics still apply. Functionality, performance, portability: these classics from quality management remain relevant because an AI is embedded in a larger overall system. Performance testing, usability testing, regression testing and requirements analysis do not fall away.

What is new are features that did not exist in traditional software. Autonomy describes how independently a system can act, how long it does so under which conditions and when it returns control. An automated car does not drive itself indefinitely; at some point, the lane assistant will signal and ask you to put your hands back on the steering wheel.

Ethics becomes a test subject. In an accident situation, a decision has to be made, and testers will increasingly have to question the requirements behind it: What should the AI really do at this point? Should it decide for itself or hand control back to the human?

Transparency is particularly important for testers. Without insight into why an AI makes the decisions it does, quality can hardly be measured. The associated field of research is called Explainable AI, or XAI for short.

How an AI learns the wrong thing

An AI learns from the data it receives, and sometimes it learns the wrong thing. Garbage in, garbage out: if you feed a network with misleading data, you end up with a problem that is difficult to recognize.

A specific case illustrates this. An AI was supposed to use image recognition to recognize what was set on a smart heating control system. The accuracy stubbornly remained at 40 to 45 percent, in one case 68 percent, too low for use.

A so-called heat map showed what the problem was. All the photos had a time stamp from the camera. The AI had learned the timestamp, nothing else. Hidden correlations such as these are precisely the reason why testers need to take a close look at the input data before interpreting the results.

There is also a linguistic trap lurking here. In the AI workflow, the term “test data” refers to the data used to train a network. For traditional testers, test data is what you test with. Two different things under one word, and keeping these terms separate is part of the familiarization process.

Metamorphic testing: trust without a known result

Metamorphic testing solves the oracle problem by deliberately changing inputs without allowing the expected output to change. You don’t know what exactly should come out, but you know that the output must remain the same.

The principle can be demonstrated using a triangle. If you extend all sides by five centimeters, it is still a triangle. This so-called metamorphic relation describes a change that must not change the result. If the AI reacts correctly to this, you gain a degree of confidence, even without knowing the absolute target result.

The method is particularly useful for uncovering hidden misprioritizations. If an image classifier learns shapes that are each shown in a different color, such as red circle, blue rectangle, yellow triangle, it can happen that it no longer learns the shape, but the color.

Then you vary the color in the test. If the circle was blue, you show the AI a blue rectangle. If it classifies it as a circle, it is clear that it uses the color as a feature, although this should not play a role. The metamorphic relation thus exploits the fact that irrelevant properties are varied in order to test out precisely such errors.

The more metamorphic relations are applied, the more confidence is gained. The procedure is not new, but is gaining in importance where the test oracle is missing.

Proven methods with a new meaning

Not everything has to be reinvented. Several established test methods can be applied directly to AI systems because their properties fit the old problems particularly well.

Pairwise testing: AI systems have very many parameters and therefore countless parameter combinations. Testing all of them is impossible, which is why pairwise testing helps to keep the number of test cases manageable.
A/B testing: Where the target result is unclear, two AIs can be developed with the same goal and evaluated by two user groups. This makes up for part of the missing oracle.
Metamorphic testing: Change inputs without changing the expected output and test the behavior.

These methods were already known years ago. In the AI context, they take on new weight because the missing oracle and the high number of parameters are exactly the gaps they fill.

Why test environments for AI are incomparably more complex

With traditional software, inputs and outputs are usually structured: Databases, tables, sensor data from physical systems. For AI applications, the range of inputs grows enormously and a small set of test data is no longer sufficient.

Autonomous driving requires an entire environment to be simulated: Cities, streets, buildings, pedestrians, other vehicles. This simulation applies not only to classic components, but especially to AI components with their multimodal inputs from cameras, lidar and radar sensors.

Real testing is hardly affordable. It is not feasible to run through extras and complex scenarios in reality, which is why there is a strong focus on virtualization. Computing power, data provision, reusability and anonymization of the data are also part of the effort.

Reproducibility is not a given with AI

It is extremely difficult to reproduce results exactly with AI because a lot of training involves random generators. The parameters of neural networks are often chosen at random, and this randomness must be repeatable.

Frameworks can store random seeds, which helps, but not always. Not every state is recorded in concrete terms, and some results simply cannot be reproduced.

Scenario-based testing shifts the goal. Instead of repeating every run bit by bit, the scenario is described as clearly as possible so that the basic principle is reproducible. Since statistics determine the result, not every single run is reproducible, but the statistics as a whole are.

A driving scenario consists of many parameters at the same time: the car is approaching a red light, a truck is coming from the right, a bus is coming from the front, a woman is standing on the left, a child is playing on the right, plus the weather and sunlight. What used to be a short list of constraints becomes a long list of real parameters that describe the scenario.

Tools today are more likely to test with AI than AI itself

The tool market is currently more focused on integrating AI into testing tools than supporting the testing of AI. The appeal of playing with new technology is great and human, and so testing with AI even came before testing AI.

Generative models are useful for testing AI itself when too little test data is available. They generate text for chatbots, images for person recognition or synthetic data for finance, for example. Whether they are explicitly built for this purpose or not, they can be used.

A practical way to get started is to have a language model write a test. Good testers who are quickly disappointed should not give up, but play with the prompts.

AI will not replace us, it will make our work easier. It will not become less.
Gerhard Runze

Standardization and standards are still in their infancy

Standards for testing AI are only just beginning to emerge. It started with an A4Q syllabus, which was replaced by the ISTQB’s CT-AI syllabus, which was published a few years ago and later followed in German.

The standardization roadmap, initiated by the German government and presented at the Digital Summit, is working on the topic in parallel. Now in its second edition, it shows where standardization is needed, written by several hundred authors. It looks at domains such as finance, medicine and transport as well as overarching topics, such as the impact of the EU AI Act.

Reproducibility remains an open flank. How something that cannot be reliably reproduced can be standardized has not yet been fully thought through. A group has already been formed for large language models to structure requirements and initiate standardization.

What testers need to change now

Testers need a new mindset for AI, not a completely new craft. Statistical thinking is replacing the old yes/no logic: probabilities, inaccuracies and confidence levels are replacing the clear expected result.

The focus shifts to the data. Testers need to understand what data a network has been trained on, what hidden features it might have drawn and whether the training data actually reflects what the AI is supposed to learn.

At the same time, the traditional foundation remains strong. Anyone who masters CTFL has the basis. The AI-specific methods build on this, from metamorphic testing to pairwise and scenario-based approaches. What is actually new is not the tool, but the willingness to think quality without a fixed oracle.