AI-supported test case determination

Sherlock is an AI-supported test assistant that helps employees in specialist departments to create standard-compliant test cases in accordance with ISO 29119. It runs on MUCGPT, Munich’s own GPT platform. Professionals without test knowledge describe their application, Sherlock generates structured test cases with preconditions, test step and expected result, which can be imported directly into tools such as TestLink or Jira X-Ray.

Key Takeaways

Employees in specialist departments have in-depth domain knowledge, but hardly any test knowledge: Sherlock starts right there and enables standard-compliant test cases according to ISO 29119 without prior testing training.
Prompt engineering is the decisive success factor: if you assign Sherlock the role of “test analyst”, you get boundary value analyses and equivalence partitions; if you don’t, you end up with math.
Sherlock exports finished test cases as XML for TestLink or as CSV for X-Ray, so that departmental employees can import the results directly without having to type manually.
Watson, the counterpart to Sherlock, automatically generates an initial test report from entered test data based on an existing template from the City of Munich.
MUCGPT has no internal knowledge of the City of Munich, so users have to manually load technical specifications or system documents into the chat until a RAG connection is realized.

Why departments fail at test case creation

Specialist departments have in-depth domain knowledge, but hardly any test knowledge. It is precisely this gap that makes writing test cases difficult for them. They know their application in detail, but are faced with a blank sheet as soon as structured test cases are to be created.

The situation is different in the technical field. Anyone who knows the software development process also knows how a test case is structured. However, the closer you get to the specialist areas, the greater the distance to the test craft.

The obvious approach of training everyone involved often fails in reality. Training courses cost time and money, and departmental employees have to deal with day-to-day business. Testing runs on the side. A foundation level cannot be rolled out across an organization that implements hundreds of projects per year.

An AI test assistant as an entry-level aid instead of a replacement

The state capital of Munich is addressing this gap with Sherlock, a test assistant based on its own GPT fume cupboard MUCGPT. Sherlock creates test cases in accordance with ISO standard 29119 and serves as an entry point into testing for specialist departments.

The process is deliberately kept dialogical. Anyone who asks Sherlock who they are receives an answer including a sample test case. This enables business users to recognize what a standard-compliant test case actually looks like and then begin to iterate themselves: request a test case for a specific function, refine it, repeat it.

Sherlock also provides the format. Precondition, test step, expected result and postcondition are pre-structured according to the standard. The name says it all. Sherlock is the investigator with the magnifying glass who examines and creates test cases.

The distribution of roles remains important. The AI is the assistant, the human remains the decision-maker. The department decides whether a generated test case is adopted or refined.

Output is only as good as the input

A test assistant without context delivers generic results. MUCGPT does not yet know the city of Munich, which is why it must first be given the technical or system specification for a specific technical procedure. It can then create suitable test cases within this chat.

If this context is missing, the model is invented. A question about the organizational structure of the City of Munich provides an answer, but not necessarily the right one. Anyone using Sherlock should take this into account and check the results professionally.

A connection to Retrieval Augmented Generation is planned. This would allow specialist departments to feed in their own data, specialist concepts, system specifications and user stories and connect internal intranet pages instead of manually copying specifications into the chat.

AI is not deterministic, and that can be used

If you enter the same prompt five times, you won’t get the same test case five times. This is in the nature of things, because an AI model varies its output.

Instead of fighting against this, the variation can be used productively. Request several test cases at once, about five, and view them together. This allows you to quickly recognize which variants fit and fine-tune them from there.

Prompt engineering is the real key competence

Prompting comes before the tool. In the training courses offered by the City of Munich, working with Sherlock does not start with the assistant itself, but with prompt engineering.

The role assigned determines the result. If the model is given the Test Analyst role, it knows limit value procedures and equivalence class analysis. Without this role, a question about the boundary value ends up in the math instead of in the test.

This is exactly what makes Sherlock tangible: a system specialist for a specific task. In addition to the role, context, background information and formatting specifications are also important. Sherlock already includes formatting according to the ISO standard.

The media break to the test management tool remains a hurdle

The generated test cases have to be transferred to the test management tool, and this is where new friction arises for specialist users. There are three tools used by the City of Munich: TestLink as an open source product, X-Ray from the Jira world and SAP Solution Manager.

Sherlock exports an importable XML file for TestLink and a CSV file for X-Ray. The current limit is a maximum of ten test cases per import. SAP Solution Manager is not yet connected.

For business users, the tool itself is already a hurdle. Where can I find my test cases? In TestLink, they are under test specifications because that’s what the tool calls them. The automatic import removes this step because the business users no longer have to write or enter the test cases themselves.

The limit of ten test cases has a technical reason. Depending on the model, there is a maximum token output length above which the output is cut off.

Watson writes the test report

Where Sherlock investigates, Watson documents. The second assistant is under development and automatically creates a test report from the key figures entered.

Watson works with a system prompt in which the relevant data is entered: Number of test cases performed, successful and failed, bugs found with numbers and the project number. A first draft is created from this based on an existing test report template.

This draft is not final. However, the core data is already included, which noticeably reduces the effort required for the report.

Requirements grow from use

The most useful functions arise from feedback from the specialist departments, not from the drawing board. The first version provided a single standard-compliant test case, but the departments asked for ten. This is how the multiple edition came about.

Another pain point is the creation of error reports. Which ticket type, which fields, how to fill them in? The plan is therefore for users to describe the error, the wizard maps the information and imports the ticket into Mantis or X-Ray.

Concrete usage figures can hardly be collected at present. Feedback comes in sporadically, for example via internal Open Space formats. In the future, it should be possible to subscribe to the assistants, so that the number of subscribers can at least be used to determine how widespread they are.

More personal responsibility through a roles and rights system

Until now, adjustments to the assistants had to be made by the developers at the AI Competence Center. With MUCGPT 2.0, this changes thanks to a roles and rights system.

In future, the person responsible will be the owner of their assistant. They can change the system prompt, adapt sample answers and respond to feedback without having to go back to the developers every time. This relieves the burden on the AI competence center and shortens the path from idea to function.

In the end, the AI is the assistant and the human is still the decision-maker.
Mark Menzel

Innovation depends on people and an enabler

An administration is quickly seen as outdated, but this image does not hold true. The drivers here are concrete people, from the AI competence center to dual students who implement topics such as Watson in their theses.

The real enabler was the early decision. The state capital of Munich installed its MUCGPT back in April 2023, jumping on the trend instead of sitting it out. Acceptance is broad: the Social Department, the Building Department and the Department of Education and Sport are at the forefront, not just IT.

The benchmark remains the benefit. Innovation is not justified by itself, but by the added value it brings to the work and the city.