Software testing by and with AI 

 18 February 2021

Artificial intelligence (AI) and software testing are two important topics in today's software and system development. The application with or on top of each other holds the opportunity for enormous synergies.

Although artificial intelligence has been in research for decades, it has been on a very media-savvy triumphant march in recent years. All tasks seem solvable, all human intelligence unnecessary, the possible consequences controllable. There are very many convincing demonstrations. One example is AI-controlled computer players beating the world's best GO player Lee Sedol in the form of Alpha GO. GO, by the way, is significantly more complex than chess, so the gameplay here is infinitely more difficult to predict. Other AI applications can unerringly recognize the contents of images. This opens up a wide variety of applications, from early diagnosis of dangerous diseases to surveillance of public spaces. But is it that simple?


Of course, the results presented seem very convincing. But is the path chosen by the AI to achieve the result always really intuitive? Research has shown that some images, for example, were not classified as a horse image based on the actual horses depicted. Instead, on the basis of the piece of forest also present in many horse images. Others were classified by the signature of the photographer (who often takes pictures of horses). Thus the miracle of AI was disenchanted by some prematurely lauded examples. Memories of the horse "Kluger Hans" were awakened, which could only count for appearances.


In addition, failures such as the accident of an UBER vehicle were exploited in the media, so that autonomous vehicles were soon considered a danger. It is easy to overlook the fact that in Brandenburg alone, an average of 2 to 3 people are killed in traffic accidents every week. Here, even a not quite perfect AI could certainly offer advantages. But there are also other issues behind this. Accordingly, this technology is sometimes exaggeratedly exalted to the skies and sometimes condemned before the connections are clear.

So for better or worse, I see an exaggeration here. Despite all the hype, AI has a lot of potential, including in safety-critical applications. The prerequisite for this is, of course, that this technology can be well secured.

A number of questions can be asked in this context. In the following, I will touch upon a few of them on various sub-topics and thus offer an introduction. As already mentioned, science has been dealing with deeper questions around this topic for decades.

Evaluation of the AI

First, statistics plays a major role here and is used for internal evaluation of situations, images, etc. Using the confusion matrix, prediction and reality can be evaluated for binary classifiers: What is predicted correctly? Where and how does the AI err?

There are various means for evaluating these results. For example, the harmonic mean of accuracy and sensitivity, also called F1 score. In any case, it is clear that the importance attached to the result varies depending on the domain. For example, misdiagnosis on the existence of a tumor (not actually present) is not so bad. However, the non-recognition of a (actually existing) tumour for the life expectancy of the patient is very much so.

Test know-how

Furthermore, the experienced tester naturally has another question: Which of the on-board quality assurance tools he has been familiar with for many years are applicable here?

  • Are white-box testing procedures useful at all, or is this more akin to the still controversial attempts to determine human intelligence?
  • Does it make sense to divide the test into different test stages as we know it from the V-model? For complex systems that hide one or more AI-based algorithms inside, this makes perfect sense. Does it also make sense for machine learning with a lot of intermediate layers? This leads in the direction of implementation explainability.
  • What do we actually look for in the test? Is it just a matter of the algorithm producing better results than its predecessor, or do we subdivide it a little more precisely? By functional tests and non-functional tests? What about IT security? Even minimal changes to the design of traffic signs can have an impact: if an autonomous vehicle interprets "30" on a kmH sign as "80" and wants to drive through the city with appropriate momentum. Equally disastrous can be the effects of inconsistent situations such as the stop sign on the highway.
  • Furthermore, the question arises as to when the self-learning system is actually allowed to learn? Permanently in use? If so, a commuter's self-driving vehicle could very soon be trained on the peculiarities of the daily route. The rest will be "forgotten." Or should AI rather only be allowed to learn in service or development? What are limitations depending on the application domain?

AI for the test

On the other hand, we testers are naturally tempted by another idea, namely to use the unlimited possibilities of artificial intelligence for software testing itself. There are interesting developments in this area as well. One area of application that stands out is performance testing. Here, the AI can detect anomalies in system behavior and system load depending on the input data. These observations could be used to move the system closer and closer to the load limit or beyond.

Finding similarities and commonalities can be applied in many other fields. For error messages, test specifications, log files of the test object, the generation of test data based on data format descriptions or test sequences based on code analysis. Another exciting topic is the use of an AI as a test oracle. Here, another question arises: can't an AI that serves as a test oracle also be used as a system to be tested at the same time? And can it do it even better than the original? In addition, the question of limits also arises immediately: which decisions can and do we want to leave to an AI? Some people are reminded of the trolley problem. This is already unsolvable for humans, or at least usually difficult to justify. If a fatal accident is unavoidable and one can still influence the outcome, who may live and who must die?

These and other thoughts are an introduction to this highly interesting topic. It is very interesting, especially economically, and will have many exciting years ahead of it.

The article was published in the 01/2020 issue of German Testing Magazin.