AI testing - a checklist

AI testing means testing an AI system as a component in the overall system, not the model itself. The central point is the question of determinism: if the system always behaves the same with the same input, classic test methods apply unchanged. If it is non-deterministic, statistical key figures and coordinated test data sets between testers and data scientists are required.

Key Takeaways

Whether an AI system is deterministic or non-deterministic depends on the overall test strategy: With deterministic behavior, AI components change practically nothing in the test approach.
Testers and data scientists have overlapping activities when creating test data, which is why close exchange between the two roles can directly improve the quality of the training data.
Basic statistical knowledge is no longer a nice-to-have for testers because AI systems only deliver functional outputs with a certain probability by design and classic pass/fail logic alone is not sufficient.
Model updates require separate, smaller test data sets to ensure that a system does not develop in an uncontrolled manner in an undesirable direction after retraining.

The most important question first: Is the system deterministic?

Before you start thinking about statistical methods, model internals or new testing tools, clarify one thing: Is your system deterministic? This question determines your entire test strategy.

If the system is deterministic, almost nothing has changed for you as a tester. You have functional requirements and a subcomponent that does its job. Whether this is based on AI, machine learning or linear regression is then of secondary importance. The thing must always behave the same with the same input, so you can define fixed input-output pairs and use them for testing.

Sometimes a system is only deterministic to a limited extent. Perhaps the output for the same input only changes when the model is retrained. If this happens once every six months, you can work with fixed test data and only have to ask yourself in good time before the next model release whether your assumptions still fit.

Only when the answer is clearly “not deterministic” does it become more complex. Then you end up with the question of how often you need to run a test case to be statistically sure that your assumptions are met. Essentially, however, this is a repetition of the familiar, not a completely new game.

Why nondeterminism can also occur without changing the model

A fixed model does not guarantee deterministic behavior. The same model can deliver different results on different computer architectures, even if nothing has been changed in the model itself.

In such a case, you need to take a closer look: Where is the system running? Is the hardware constant? Which algorithms are running in the GPU during matrix operations? At this point, testing becomes unpleasant because race conditions can creep in between computing units.

However, you can often fall back to a deterministic case. If the system runs on-premise on fixed hardware, you may not need GPU acceleration at all, but work with a single-core CPU without race conditions. Reproducibility is then restored. This is precisely why it is worth clarifying early on where and on which hardware the model will actually work later.

Testers and data scientists work on the same thing, from two directions

The greatest leverage lies in the conversation between testers and data scientists. Both roles have overlapping activities, but see different gaps.

A data scientist is given a problem and develops a model that solves precisely this problem. They evaluate it using mathematical methods and determine that the model works. What can get lost in the process: The model is only one part of a larger system that is supposed to fulfill a specific function. Operational blindness quickly sets in according to the motto “model finished, evaluated, therefore finished”.

This is exactly where the tester comes into play. He brings a view of the overall system and checks whether the assumptions from the training also fit the later application. They can identify the boundary conditions of the data that were not considered during training.

These boundary conditions can make a data scientist overjoyed or heartbroken. Overjoyed, because new ideas lead to a better training data set. Deadly unhappy when it becomes clear that the previous training data set was for the bucket. Both are valuable because both improve quality.

What testers should also learn

Statistics are becoming more important. If a model only delivers the expected output a certain percentage of the time, you need to understand what this means for your test. “Stats 101” is often enough to know where you need to ask questions.

If a model says 99.9 percent accuracy, the relevant question is: What exactly does this mean and is it sufficient for the intended purpose? You don’t have to invent terms like accuracy and precision, but you do need to be able to classify them.

These methods will help you specifically:

Statistical and stochastic methods to evaluate probability outputs
Combinatorial testing, which is becoming increasingly important in the AI environment
Exploratory testing, which is also becoming more important, not less important

You don’t need in-depth knowledge of the work of a data scientist for this. They are two separate roles with two skillsets. If you are working in both roles for a small company, both are worthwhile. Otherwise, it is enough to understand the output of the models, because it is precisely this output that influences your test design and test strategy.

The difference between machine learning, deep learning and other methods is nice to know. It is not a mandatory skill for testers, even if many want to learn it because the topic is currently hot.

How much of AI do you need to understand anyway?

You can treat most of it as a black box. Fully understanding what happens in a model is an open area of research. Explainable AI does not exist for free and has not been conclusively solved.

So you don’t always have the chance to understand everything. But you do have the chance to understand it better than you did at the beginning. To do this, ask yourself the question: Which parts of my system use AI at all, and where do I need to take a closer look?

The readability of the model determines how deep you can get into it manually. If it is a decision tree, you can still go through it depending on its width. If it’s a deep neural network with formulas that would fill hundreds of pages, it will be difficult for everyone.

Nevertheless, a small section of the basics is useful. It gives you a sense of where you need to ask questions and helps you to recognize your own blind spots.

Why the question “Do we even need testers?” has been answered

Yes, we need testers. The question came up in the working group, and some tool manufacturers may claim the opposite. The answer remains: We need people who approach a system with a test focus.

This focus is not an accessory. A data scientist evaluates his model with the values that seem sensible to him. The tester checks whether this makes sense in the overall system and in the real use case. These two perspectives do not replace each other.

Test data: Do you need a golden dataset?

When it comes to test data, it’s worth talking to the data scientist again. They have already generated training data and you need to clarify what you can reuse and what you still need.

Ask yourself whether you always need to test your system on the same data set. Maybe it’s not a use case at all, maybe it doesn’t matter. Maybe you also need to validate the training of the model in the test. Then agree on what the golden dataset is for testers and what it is for the data scientist.

Data scientists already do activities like creating test data anyway because they need data for model training. This creates data sets that you can use directly for testing, or at least a feeling that will help you later when generating your own test data.

Model updates need their own tests

An often forgotten point: If the system continues to learn through mechanisms, you need smaller datasets to simulate updates. This ensures that the development is moving in the right direction and matches reality.

The examples of chatbots that are given free rein and develop into interesting personalities in a short space of time show what happens if this case is ignored. How does a self-learning model behave under update conditions? If it is not tested, this is exactly what can go wrong.

There’s no silver bullet to tell you if you need this. But ask yourself the question. And remember that it’s not just about the initial test setup, but also about smaller update testing datasets for the ongoing model changes.

Much is old wine in new bottles

Non-deterministic systems existed before AI. “Eventually consistent” systems from the microservice environment have long forced you to tailor test cases in such a way that they can cope with non-deterministic behavior.

Statistics in testing is not new either. A good non-functional requirement for time behavior sounds something like this: In 95 percent of cases, the website should be delivered in under 300 milliseconds. So anyone who takes performance testing seriously has long since dealt with statistics.

Just because you now have an AI system, you hopefully don’t have no requirements. Take a look at your requirements and go back to your normal testing methods.

Marco Achtziger

What changes is the output of the AI component. The interface no longer simply delivers B on A, but B with a probability of x percent, plus key figures such as accuracy and precision. You take the rest of your toolbox with you.

Classification problems can be tested with known methods

Many AI tasks are classification problems, and it is precisely these that you can map using established test methods. You throw in a picture and want to know whether it is 90 percent a cat or a dog.

You apply your existing methods to such cases. What are the functional testing methods? Are there boundary values to consider? Can equivalence partitions be identified? These questions will lead you directly to a clear test design.

The most common blocker is not a lack of technology, but shock and fear of the buzzword. Anyone who hears the big word “AI” quickly thinks of human-like intelligence and believes they need a psychologist instead of a tester. It is precisely this huge structure that needs to be broken down into something tangible: your requirements, your components, your known methods and the one central switch between determinism and nondeterminism.