Acceptance test-driven LLM development

Acceptance Test Driven LLM Development refers to an approach in which failed acceptance tests from real error dialogs also serve as training data for fine-tuning a Large Language Model. New business requirements are specified as dialog-based test cases, the model is retrained and then automatically checked against all previous tests to detect regressions.

Key Takeaways

Acceptance Test Driven LLM Development transfers the principle of failing acceptance tests directly to fine-tuning: A test fails, new training dialogs fix the error, then the test suite proves whether the model has learned without generating regressions.
LLM outputs cannot be checked by string comparison because semantically identical answers never have the same wording. Verification must therefore differentiate between strict structural comparison and semantic checking, depending on the output type.
Real productive dialogs from pilot operation are the primary source of data: anonymized conversations show which queries the model does not yet solve autonomously and directly provide the material for new training and test data.
Template-based dialogs enable combinatorial testing by replacing template variables with many different values and thus automatically creating a large series of measurements from a test case.

What Acceptance Test Driven LLM Development means

Acceptance Test Driven LLM Development applies the principle of acceptance test driven development to the training and validation of language models. Instead of training a model only once and hoping that it fits, the team first writes acceptance tests that the model does not initially pass and then develops specifically against them.

The term comes from David Faragó, who works at Mediform on a telephone assistant for medical practices. His point: LLM development as a discipline is still in its infancy. The models themselves are powerful, but the processes and tools around them are lagging behind.

The appeal of the approach lies in the combination of two worlds. The training set and the test set come from machine learning. From Agile software development comes the acceptance tests as a safety net. If you derive failing test cases from real examples, you have a measurable statement afterwards as to whether the model has improved.

Why testing language models is so difficult

Verifying an LLM is difficult because it is not deterministic and behaves like a black box. The same input can produce different outputs, and no one can directly see why the model responds the way it does.

Faragó quotes an apt image: LLMs are technology that was given to us by aliens. They are excellent at dealing with natural language and drawing conclusions, but can hardly be controlled using classical methods.

To make matters worse, natural language applications cannot be checked using simple string comparisons. Whether the bot says “Did I understand correctly, you want to book an appointment?” or “So you want to book an appointment?” is not the same string, but semantically identical. It is precisely this semantic check that is the difficult part.

The three directions for quality

Mediform meets these challenges with a solid process, fast validation and customized verification.

The process is based on CPMAI, Cognilytica’s Cognitive Process Management for AI. It combines modern software development with modern machine learning and integrates agility, which older approaches such as CRISP-DM do not achieve.

Validation takes place over short iteration cycles with directly involved pilot customers. Patients use the bot, the team analyses the anonymized dialogues and deduces what needs to be adjusted in the next iteration.

Verification is based on a test tool that has been further developed from EleutherAI’s LLM Evaluation Harness. This harness is also behind the well-known LLM leaderboard at Hugging Face. Mediform has given it a major upgrade and tailored it to its own business requirements.

Task-oriented dialog is right in the middle

Telephone assistants for medical practices are a special case in testing because they lie between two extremes. Faragó calls this class of applications task-oriented dialog.

At one end are tasks with a clear answer. A string comparison for identity is sufficient there. At the other end is free creativity, such as a generated poem where a human reads over it and small deviations are irrelevant.

A telephone bot that books an appointment sits in between. It has to solve a specific task, but formulates the answer linguistically freely. That’s why a check is needed that distinguishes between literally identical and semantically identical instead of just comparing character strings.

How the tool-former approach controls the check

Mediform uses an agent-based tool-former approach in which the model generates two types of output. One type is natural language text for the patient. The other type is function calls.

Prefabricated messages are triggered via such function calls so that the model does not have to regenerate long blocks of text each time. The model also generates function calls for database queries and similar actions.

The verification tool checks with varying degrees of rigor depending on the situation. In the case of free text, it can be more accommodating; in the case of function calls and database queries, the output must be exactly correct. This distinction determines whether a test is run as a string comparison or as a semantic check.

The dialog is at the heart of the entire process

Mediform maintains a single dialog format throughout the entire process. The same structure serves as training data, as test cases and as a format that the test tool also reads.

The process follows the CPMAI cycle. After deployment, the pilot customer collects anonymized dialogs that have actually taken place. The team analyzes them in Business and Data Understanding, measures, for example, how many dialogs reached their destination completely autonomously and identifies the suboptimal cases.

New acceptance tests are created from these faulty dialogs. Because they originate from real errors, they initially fail: The model cannot yet cope with them. In parallel, the team generates many training dialogs in the same format.

The model is then retrained and tested against both the new and old tests. This makes it possible to simultaneously measure whether the model has learned and whether it shows regressions.

The six stages of CPMAI at a glance

CPMAI structures the LLM development in six iterative stages, between which you can jump back and forth. Mediform has combined the first and second stages for its own use case.

Level	Content at Mediform
Business Understanding	Combined with Data Understanding because the same dialogs are analyzed
Data Understanding	Analysis of the collected dialogs, derivation of new business requirements
Data Preparation	Writing acceptance tests, creating training set
Model Development	Fine-tuning the language model
Evaluation	Execute test tool, calculate business-oriented metrics
Deployment	Demo, pilot customer or production to collect new data

The sixth stage is the real innovation compared to CRISP-DM. Only deployment closes the circle because it provides new, business-oriented data for the next iteration.

Fine-tuning decides when prompt engineering is no longer enough

Fine-tuning comes into play when adapting the prompts alone is not enough. With prompt engineering, you only change the input. With fine-tuning, you adjust the weights of a basic model to the specific task.

Mediform uses two variants for this. One is a fine-tuned GPT, the other a fine-tuned open source model such as Mistral. Prompt Engineering remains part of the solution in both cases.

One special feature lies in the training data itself. The dialogs come from speech-to-text, so they occasionally contain transcription errors. The testing framework has to deal with the fact that the input does not consist of cleanly typed text.

A language example shows the extent to which fine-tuning influences a model. A Mistral model fine-tuned to German dialogs understands French perfectly, but occasionally responds in French dialect in German because it tends to speak German. French training and test dialogs are therefore needed for a French pilot customer.

Template-based testing opens the door to metamorphic testing

Because acceptance tests and training data are written on a template basis, metamorphic testing is easy to implement. You take a relevant dialog and replace the template variables with many other values.

This results in combinatorial testing for a single aspect. Instead of a single case, the team performs many measurements on the same question and sees whether the model remains stable across variations.

There are also stress tests before deployment. The team also automates the patient side with its own language model and runs the freshly fine-tuned model against this patient model. The resulting dialogs are then evaluated.

However, the actual safety net remains the acceptance tests. Unit tests continue to run for the code around the model, but the behavior of the LLM itself is primarily ensured by the dialog-based acceptance tests.

You take precedents or examples, use them to specify the new requirements and then have a measurement option or even a safety net so that you can really measure whether you have improved it afterwards.
David Faragó

Why iterations cannot be tied to rigid sprints

The iteration cycles at Mediform do not follow a fixed rhythm, even if the team works in two-week sprints. A CPMAI cycle does not necessarily coincide with a sprint.

Sometimes several iterations fit into two weeks, sometimes a single cycle takes longer. Ideally, the sprint and cycle should overlap, but in practice something regularly comes up in between.

The driver is demand. If a new pilot customer needs an additional language, the team inserts a quick intermediate iteration and starts with the existing model before generating new training and test data. This flexibility is part of the approach, not a break with it.