AI-generated test cases in a regulated environment

AI-supported test case generation in the regulated medical technology environment works when the system does not interpret data but only retrieves its own documentation. A RAG system delivers text chunks via similarity analysis, while an LLM uses them to generate structured test cases. The human reviews and releases formally. This creates traceability without the need for tool validation.

Key Takeaways

The Hyde principle solves the tagging problem for hundreds of thousands of document pages: Instead of asking questions, the system formulates assertions and uses similarity analysis to check which document chunks support them.
AI-generated test cases in the regulated area do not require tool validation as long as a human formally reviews each output and approves it via eSignature.
Automatic prompt engineering, where one AI subsystem generates optimized prompts for another, enables one-click test case generation without training testers in prompting.
Modularity beats model selection: If you build the AI architecture in such a way that individual models are replaceable, you are independent of which model is currently in use.

Why AI-supported testing looks different in medical technology

In the regulated environment of medical technology, an AI system must not add anything. This is precisely the central condition under which Alexander Frenzel from Fresenius Medical Care has built a test case generator: an assistance system that supports testers without undermining regulatory traceability.

Fresenius develops hemodialysis machines such as those found in dialysis clinics. Patients’ blood circulation depends on such machines. Mistakes are not an option, and innovation must be subordinate to this reality. “Good is not good enough for us,” is how Alexander describes the claim that frames every technical decision.

The testing of these systems is multi-layered. It involves not only software, but also hardware, electronics, materials and biocompatible components, all of which interact with each other. The proof of concept deliberately started with the software, with the aim of rolling it out more broadly later on.

What distinguishes the test case generator from a chatbot

The generator is not a prompt tool for end users, but a one-click approach. The tester enters a requirement, presses a button and a test case comes out at the end.

This decision is based on usability. In a large test organization, you would otherwise have to enable many people to prompt properly. Instead, an AI subsystem takes over the prompt creation itself: It generates several prompts from the input, which are optimized for a downstream AI system.

Behind this is not a single AI, but an architecture consisting of several systems, some of which have different LLMs and models. This complexity is precisely the reason why the question “How do I set something like this up?” determines success, not the choice of a single tool.

Why traceability must be built in from the start

There must be no room for hallucinations in the regulated sector. This is why reasoning was a mandatory component for Fresenius even before the broad AI wave, at a time when common chatbots did not yet offer this.

Testers need to know where a statement comes from. How does the system get from a text module to the boundary values of a boundary analysis? How are the equivalence partitions created? It must be possible to map these thought processes, otherwise the result is worthless in a regulatory sense.

Fresenius did not bring AI expertise to the table, but rather domain knowledge. The AI expertise came from external partners. Our own contribution was the business concept: defining what the system must look like in order to create real added value.

How the Hyde principle makes distributed documentation usable

The system works with its own documents via a RAG system without interpreting them. This means that nothing new is added and the testers retain their original documentation.

The basic problem is that the data is stored in distributed systems and repositories, some as Word files from 2002 that have never been transferred to an up-to-date ALM system. With hundreds of thousands of pages, nothing can be tagged manually.

The solution came from one of the architects: the Hyde principle. Instead of asking a question, the system makes an assertion and uses the data in the RAG system to prove it. An AI system recognizes similarities more easily than direct answers.

The system uses a similarity analysis to check the percentage of a text section that matches the assertion in terms of vectors. In this way, it finds the relevant chunks without the AI changing the content. Another advantage: Fresenius would not have had the necessary amount of its own test data to train an LLM directly for a single project. The RAG approach resolves this dependency.

The human being remains in every process step

A man in the middle was a given right from the start. The tester can intervene at any point, tweak and contribute their own experience without having to start from scratch.

This makes the output a draft, not a finished result. Testers decide for themselves when to use the generated output as a basis. This is followed by testing on the machine, review and release as a prerequisite for formal test execution.

Development in 2024 ran for around nine months, with a small internal and a larger external team, on average around ten people, who managed this in addition to their regular work. With a dedicated team, this could have been done in two to three months, but is probably faster today.

Modularity beats choosing the perfect model

The architecture is built in such a way that models are replaceable. This was more important to Fresenius than choosing a specific model, because these models are constantly evolving.

A lot was tried out in the course of the project: first Llama, then Claude, then Llama again. In addition, there were technical hurdles relating to operation in a separate cloud, as intellectual property cannot be moved to public services. The system has to run separately and remain manageable.

If a new model is released, it can be swapped for the existing one and tested for higher output. This adaptability keeps the solution viable across model generations.

Where AI-supported testing reaches its limits

The generator works very well with software. It becomes difficult as soon as physical dependencies come into play.

One example: a hose system that is being filled. The pressure is not static, but highly dynamic. Such effects cannot be easily modeled. You need a physical model behind it, and this model would have to be updated with every new feature, every new engine, every replaced component. This effort places strict limits on AI support in the physical domain.

Why tool validation is not necessary here

A generative, non-deterministic system can hardly be classically validated. After a long discussion, the decisive question arose: Is it even necessary? The answer was no, and with good reason.

Several conditions support this assessment:

The output serves as a draft, not as a final result.
A human reviews and formally approves via e-signature.
The data is stored unlabeled in the RAG system; the system does not change it.
Documentation is created for each process step, which can be accessed and tweaked at any time.

Then it’s just a tool that empowers you, but not one that does the work for you. And that puts you back on a good regulatory footing.
Alexander Frenzel

With good logging, what comes out remains traceable. How exactly the system arrives at a result cannot be reconstructed one hundred percent, and linguistically, sometimes one formulation is better, sometimes another. However, the underlying information remains the same because it is not changed.

What testers found surprisingly useful

In the internal beta testing, the first reaction was euphoria about the output. However, the testers saw the greatest added value in an area that the team had not expected: knowledge transfer.

Because the system shows where the expectation comes from for each expected test step, testers came across documents that they hadn’t had on their radar. With a lot of distributed documentation, this makes it possible to see where relevant information is actually located.

The second major point was the ability to intervene throughout the entire process. Nobody has to start from scratch. A draft that is not perfect but is workable saves time that can be used for fine-tuning before review and release.

From draft to pipeline: the obvious next steps

The next expansion stages arise almost automatically from the existing system. The first step is to integrate the generator, which previously ran as a web application, directly into the ALM and PLM systems.

The RAG database is to be connected to various document management systems and databases via service interfaces in order to make the data more readily available. A keyword-driven test can be generated from a well-documented test specification with trace by passing the system its own keyword library.

From there, the path to a test script for a software in the loop system is short. The script could be run directly into the pipeline, tested in the simulation and revised by the system itself in the event of errors, for example using an LLM-as-a-Judge approach.

One thing remains the same in every expansion stage: the man in the middle for approval and traceability. Humans must have the necessary understanding of the system and test design in order to be able to assess whether the result really tests what is to be tested. Curricula such as the new GenAI curriculum from GTB and ISTQB help to make this assessment in a well-founded manner.