Skip to main content

Search...

Test description for AI capabilities

Making AI systems testable: How capabilities, quality criteria and structured test descriptions turn abstract standards into concrete testing approaches.

8 min read
Cover for Test description for AI capabilities

AI test description refers to the structured documentation of test cases for AI systems, organized according to test objective, test steps and acceptance criteria. It is based on two dimensions: the capabilities of an AI system (perception, processing, action, communication) and method-specific quality criteria such as correctness, robustness and protection against bias.

Key Takeaways

  • AI systems can be tested using five processing skills: Identification, Classification, Extraction, Selection and Generation. These capabilities apply uniformly to image, sound and speech processing.
  • The EU AI law has been in force since August 2024 and requires manufacturers of high-risk AI systems to meet standards on ten topics, including correctness, robustness, cybersecurity and risk management.
  • General purpose AI systems such as GPT-4 are subject to stricter documentation and safety obligations if their training effort exceeds the threshold of 10^25 flops.
  • Notified bodies for AI conformity assessment must be designated in around eight months according to European law. The European standards that are to form the basis for this are already past their April 2025 deadline.

The Federal Network Agency regulates AI via standards, not bans

The Federal Network Agency is the regulatory authority responsible for telecommunications, postal services, railroads, electricity and gas. In the telecommunications sector, it monitors the market for radio equipment such as Bluetooth radios and cell phones and works on the conformity assessment of products for the European single market.

In AI regulation, the task is shifting towards standardization. A European law on artificial intelligence has been in force since August 2024, which every market entrant must implement. Taras Holoyad, who works in telecommunications regulation at the German Federal Network Agency, is involved in drafting the standards that put this law into practice.

The focus is on so-called high-risk systems and general purpose AI systems. Standards are to define which requirements a product must meet before it is allowed onto the market. The European Commission has issued a standardization mandate with ten topics on which the authorities are working together with industry, consulting companies and certification bodies.

Why AI standardization is so difficult

Artificial intelligence is difficult to standardize because the technology moves faster than the committees can work. Many describe it as old wine in new bottles. NASA was already relying on neural networks for space programs in the 1990s.

According to Taras, there is little that is groundbreakingly new. The Transformer method is more accurate than older approaches. Nevertheless, AI is not comparable to the intelligence level of a human being, but rather a very complex algorithmic system.

It is precisely this gap that makes standardization tricky. If too much detail is specified, it may no longer be possible to test new systems in a meaningful way. If the standard remains too abstract, it does not help the tester. From Taras’ point of view, the decisive research breakthrough is still missing, and this influences every decision on how specific a standard can be.

Testing AI means testing its capabilities

The functional spectrum of an AI system can be tested by its capabilities, not by the question of whether it is intelligent. This idea forms the core of a standardization approach that describes AI from two dimensions.

The first dimension is the methods that are implemented in algorithms: classic AI with optimization and planning procedures, symbolic AI with knowledge representation, machine learning and hybrid procedures that combine rule-based and data-driven approaches.

The second dimension is the capabilities that these algorithms implement. These include the perception of images or smells, the processing of knowledge, action (robotic or software-based) and communication, as performed by a ChatGPT-type system.

Five basic skills can be identified for processing within AI models: Identification, classification, extraction, selection and generation. Metrics can be formulated for each of these capabilities, uniformly for sound, images or natural language. If you want to test a Hugging Face model, you can do so along these five capabilities in a repeatable and scalable way.

This approach is part of the international standard ISO/IEC 42102, which Taras is leading. It is being developed together with colleagues from France, the USA and Germany. Through the Vienna Agreement between ISO and CEN, an international standard and a European standard are being developed in parallel. In terms of content, this standard fits in with the topic of transparency from the standardization mandate.

Quality criteria make AI testing tangible for testers

Quality criteria provide testers with a familiar lever for making abstract AI requirements measurable. A second standard describes which criteria apply to individual methods as soon as the algorithms are implemented.

Five quality criteria can be formulated for supervised, unsupervised and reinforcement learning:

  • Correctness
  • Robustness
  • Avoidance of unnecessary distortions
  • Protection against hostile attackers
  • Information security

Each criterion includes method-dependent metrics. For example, the confidence score can be used for the correctness of supervised learning and image recognition. This creates a kind of matrix that you can place over an AI system to check it specifically.

Two standards are intertwined. One describes what artificial intelligence actually is, i.e. methods and capabilities. The other specifies which quality criteria can be verified. There is a practical reason why the level of detail is spread across several documents: different interested parties do not always allow every level of detail to be included in a single document for political reasons.

A test description language structures AI tests like code

A test case for AI can be described in structured text, similar to a function in program code. Taras is working on this as a new proposal, while the other two standards are already at an advanced stage. The inspiration comes from the Test Description Language from the ETSI committee MTS (Methods for Testing and Specification), where corresponding descriptions for protocol tests in mobile communications and the automotive industry have been developed.

The idea: you can see at a glance what it is all about. Instead of a function definition with def as in Python, you write the specified syntax directly. A testcase in curly brackets, such as Vehicle Recognition, including the Test Objective, the Test Activities and the Test Steps.

An example from image recognition: A model reads every frame of a video, inferring and classifying objects. A threshold value can be set in the structured text, such as a confidence score of 0.5. An object is not classified until this value is reached. At the bottom are the acceptance criteria that determine when the model is assumed to be correct.

If you commission me to carry out a test, I take my two standards and use them to create a structured text with the test description.

Taras Holoyad

Taras intends to initiate this test description as a new document at ETSI MTS in the coming months, probably as a technical specification. The current ETSI report on this bears the number 103 910.

When the AI law affects you

Whether the AI regulation affects you depends on which of two categories your product falls into. The law has been in force since August 2024 and is being implemented in stages according to time windows. Implementation is carried out by market surveillance authorities, each of which is responsible for segments such as medical devices, toys or radio equipment. A total of around 13 segments are to be covered.

The first category is high-risk AI systems. These are systems that are part of a safety component. Taras makes a clear distinction between safety and security: safety is the protection of people from the machine, security is the protection of the machine itself. Anyone operating a high-risk system must meet the standards from the ten topics of the mandate or otherwise go to a certification body.

The ten topics of the standardization mandate include correctness, robustness, cybersecurity, quality management, conformity assessment, transparency and risk management. The mandate is sent as a standardization request to the CEN, CENELEC and ETSI organizations, in this case only to CEN and CENELEC.

The second category is General Purpose AI Systems, i.e. systems with a particularly broad range of functions such as ChatGPT. An additional threshold applies here: if the hardware performance exceeds 10^25 flops during training, the system is considered a General Purpose AI System with Systemic Risk. These flops result from an algorithm-specific constant, the token length of the training data and the number of parameters. According to Taras’ assessment, GPT-4 has exceeded this value. This results in more complex documentation requirements and additional preventive measures for cyber security.

Everyone is under time pressure

The tight schedule is putting pressure on authorities, manufacturers and certification bodies alike. Where no standards are available or their application is not sufficient, manufacturers must go to a certification body. This body checks the test results and issues a CE mark for market access.

However, a certification body may only do this once it has been notified. In each European member state, a responsible authority must evaluate the body together with an independent expert. Only after this assessment does the body become a notified body that can contribute to market access.

The deadlines are tight. The notified bodies must be designated in around eight months, i.e. the assessment by the authorities must be completed. Within two years, these bodies should have built up enough expertise to test high-risk systems.

European standardization itself is also behind schedule. The deadline was set for April 2025, but the official website of CEN and CENELEC shows a time window of 2026. Attempts are being made to implement some standards in greatly accelerated procedures in order to submit something suitable to the Commission in good time.

Share this page

Related Posts