What is the importance of appraisal and certification of AI?

AI needs to be tested to ensure the safety and performance of AI systems. The testing and certification of AI helps to validate the quality of the technologies and create trust among users.

Why should AI systems be systematically appraised?

A systematic testing approach is necessary to capture the complexity of AI systems. By introducing a two-dimensional AI assessment matrix, test dimensions and test areas can be clearly defined.

What is the AI assessment matrix and what does it contain?

The AI assessment matrix consists of two axes: the X-axis represents different test dimensions, while the Y-axis represents specific test areas that are important for a comprehensive assessment of AI systems.

What is the AI Risk Navigator and what is it for?

The AI Risk Navigator is a free risk classification tool that helps companies identify potential risks in AI and act as a mediator between companies and regulators.

What are the technical challenges of testing AI systems?

One of the biggest technical challenges is robustness testing. This requires a thorough analysis of the system architecture as well as extensive testing to ensure that the AI system functions reliably under different conditions.

AI testing and certification

The AI Assessment Matrix is a framework for systematically structuring AI testing and certification. It arranges test dimensions on an X-axis, from technical performance to robustness and fairness to environmental impact, and maps these against the data and model lifecycle on the Y-axis. The aim is to provide a complete overview from which targeted testing decisions can be derived.

Key Takeaways

The TÜV AI Lab’s AI Assessment Matrix organizes AI test criteria along two axes: test dimensions (from technical performance to global ecological impact) against the phases of the data and model lifecycle.
AI testing is divided into three forms: direct product testing on the system, evaluation of vendor documentation, and process/people testing, all of which require a sound understanding of testing.
The AI Act does not regulate all test dimensions equally: areas such as explainability are only hinted at in the Act because reliable test methods do not yet fully exist in technical terms.
Fairness and non-discrimination are legally and conceptually different requirements that can contradict each other and must therefore be defined and tested separately.
The energy consumption of hardware and software as well as working conditions in data labeling are explicitly included in the test matrix as ecological and social criteria, not just technical functional features.

Why AI systems need their own testing system

AI is a powerful technology, and power works both ways. What can do a lot of good can also do harm. This is precisely where the question of testing and certification comes in: How do you bring innovation and safety together without sacrificing either side?

Christoph Poetsch from TÜV AI Lab describes this requirement as a mission for Europe. Trustworthy AI needs support from society. This support can only be created if it is clear what is being tested in an AI system, how it should be designed and what things it is not allowed to do.

The comparison with the classic TÜV mandate is more relevant here than it initially seems. The TÜV used to organize steam boilers. Today, the AI system is the steam boiler of the 21st century, an object whose effect is to be controlled without stifling its benefits.

For testers, this means a break with the usual logic. Traditional testing is based on clear steps and an expected result. An AI system, on the other hand, produces outputs that are not known in advance. This gap between expected clarity and actual behavior is the reason why it needs its own structure.

AI is not just a technical system, but a quasi-actor

The key conceptual step is to treat AI as a quasi-actor. As long as a system only fulfills functional tasks, the framework of functional safety, the classic case for testing, is sufficient. However, as soon as a system takes on tasks that would otherwise require human judgment, the need for testing shifts.

An AI system in an HR process decides on applications. It does something that was previously done by a human. At this point, a purely technical view is no longer sufficient because decisions with social implications come into play.

Nevertheless, the assessment remains tied to a technical reality. No one can let a person talk to an AI system for two years and then make a gut judgment. In the end, the assessment must be measurable, reproducible and technically feasible.

The AI Assessment Matrix organizes the test field according to dimensions and lifecycle

TÜV AI Lab has developed an AI Assessment Matrix, a framework that organizes test methods, metrics and benchmark data. The matrix is structured as a two- to three-dimensional system.

The X-axis carries the test dimensions, i.e. the properties that are measured on the AI system. Poetsch describes them as the sensors that are held against the system, each sensitive to a different aspect such as performance, robustness or fairness.

The Y-axis carries the test areas along the software lifecycle, from the inception phase to retirement. This lifecycle is doubled because in AI, the data lifecycle stands alongside the model lifecycle. The role of data in the development process is a clear difference to traditional software development.

There is one important detail that is easily misunderstood: The Y-axis does not mean the time of testing, but the focus. If you want to test robustness based on design decisions, you need documentation from the design phase, but you test later. If you want to evaluate a training data set, you still need it available.

The combination of both axes results in a maximum field. It is expressly not intended to be filled completely using the watering can principle. The purpose is to provide a complete overview from which you can consciously decide on the relevant fields.

The idea is not to fill up this maximum field with test resources using the watering can principle, but to know consciously: Can we get something like a complete overview, in which we then say, now let’s concentrate on certain aspects.
Christoph Poetsch

How zooming out systematically organizes the inspection dimensions

The real innovation lies on the X-axis. Discussions about trustworthy AI often end in a bouquet of demands: robust, fair, performant, sustainable. What is usually missing is the question of how these criteria relate to each other and whether the list is complete.

The matrix organizes these criteria by zooming in on them. The starting point is the individual AI system. From there, the view is gradually widened until it arrives at a global scale. Each zoom level brings its own inspection dimensions into view.

At the core are questions that are deliberately excluded, such as autonomy or a conscious inner life. One step further, when the system has an external impact, performance and safety come into focus. If something acts on the system from the outside, the focus is on robustness against random influences such as bad weather or dirty signs, and cybersecurity against targeted attackers.

If you add a human individual, the epistemic area opens up: explainability and transparency, differentiated according to what laypeople and experts can understand. In the opposite direction, when the system influences people, privacy and nudging come into view.

When there are several individuals, ethical questions arise: fairness, non-discrimination, bias. Here, the AI system acts as an authority that differentiates between two people and decides who gets a job. At a societal level, legal issues of accountability follow, and at a global level, supply chain responsibility, working conditions in data labeling and the energy and resource consumption of the hardware.

The following overview summarizes the logic of the zoom levels:

zoom level	direction of view	example test dimensions
AI system to the outside	Effect of the system	Performance, safety
Influence on the system	from outside on AI	robustness, cybersecurity
System and an individual	Human understands AI	Explainability, transparency
AI affects individual	AI influences human	Privacy, nudging
Multiple individuals	AI differentiates	Fairness, non-discrimination, bias
Society	Responsibility	Accountability
Global scale	People and ecosystem	Supply chain, resource consumption

Not everything is regulated, and that is intentional

One finding from working with the matrix contradicts a common assumption. The AI regulation does not regulate every aspect. If you map the requirements that are directly addressed to the AI system into the matrix, fields remain empty.

There are only hints when it comes to explainability, although much more could be required in terms of technology and content. There is a deliberate reticence behind this. You don’t ask for anything that you don’t know whether and how it is technically feasible.

This honesty is not a shortcoming. AI is developing at a speed that would quickly overtake regulation. Anyone who lays down requirements today that nobody can meet is damaging both safety and innovation.

Three forms of testing: Product, documentation and process

The third dimension of the matrix distinguishes how testing is carried out. The first form is concrete product testing. The AI system is on the test bench like a car. The question here is how the measuring tool is applied and what boundary values apply.

The second form is testing based on documentation. The AI regulation stipulates in many places that technical documentation is evaluated. The provider carries out its own accuracy tests and the plausibility of the results is then checked.

This second form still requires a full understanding of testing. You must be able to judge whether the correct test has been used, whether the values are plausible and whether the interpretation is correct. Without an understanding of the content, documentation cannot be seriously evaluated.

The third form concerns processes and people. Risk management and quality management play a major role anyway. Added to this is human competence, anchored in AI competence in accordance with Article 4 and in human oversight. Article 26 requires operators to ensure the expertise of supervisors, which raises the question of what criteria are used to assess human competence.

Fairness is not the same as non-discrimination

A precise set of definitions is the basis of any robust assessment. Buzzwords alone are not enough. Each test dimension needs a definition, aligned with international standardization and the AI Regulation, so that the overall set remains consistent.

The difference between fairness and non-discrimination shows why this is necessary. The AI Regulation only mentions non-discrimination in the articles, once in Article 10. Fairness only appears in the recitals.

Non-discrimination means what is required by law, for example via the AGG or the EU Charter. Fairness, on the other hand, encompasses different concepts for individual people and groups that go beyond the law and may even be in tension with it.

As soon as two concepts of fairness are contradictory, it is not possible to build a system that satisfies both at the same time. Therefore, before any testing, it must be clear which concept is meant. You can define what you mean by fairness. Which concept of fairness is the right one remains another, very old question.

Why philosophy helps AI testing

AI acts like a burning glass for questions that have preoccupied mankind for thousands of years. Because an AI system develops something like cognitive capacities, old questions about justice, understanding and responsibility can be looked at anew in a sharpened form.

Both sides benefit from this dual perspective. The technical view is sometimes too quick to use the black box label. Strictly speaking, however, a neural network is not a black box because all the information about the network is available. The only unknown is why this information generates this particular behavior, not the information itself.

The philosophical perspective, on the other hand, brings centuries of research into concepts such as justice and fairness, which are used in the AI debate. Bringing the two disciplines together allows us to organize a testing field with the necessary depth instead of covering it with buzzwords.

The next step in the matrix is to move from top to bottom. The top-down design must be concretized, because you don’t test robustness in the same way for every system. The challenge is to find the right altitude: formulated as generally as possible, but as concretely as necessary so that testing can really take place at the end.