Testing audio AI

Testing audio AI systems means checking speech recognition and speech synthesis models for errors without being able to look inside the model’s black box. Because each correction requires several days of costly retraining, so-called golden test sets and AI-generated test variations help to systematically uncover pronunciation and recognition errors.

Key Takeaways

AI audio models cannot be debugged from the inside: If you find an error, such as a misrecognized word, you have to completely retrain the model, which costs several days and several hundred to thousand euros.
Golden test sets are currently the most important quality assurance method for audio AI, because classic test automation fails due to the black box nature of AI models.
The data problem hits German particularly hard: out of 600,000 annotated audio hours in the Whisper model, only around 2,000 to 3,000 are in German, which structurally limits the recognition quality for German speakers.
Lack of diversity in training data has a direct impact on model quality: If you train with 90 percent male voices, you get a model that recognizes women poorly and makes synthesized female voices sound masculine.
ChatGPT can be used to generate synthetic test data for audio AI because it produces linguistic variations that humans simply do not come up with when manually creating test cases.

Testing audio AI means testing a black box

The most difficult part of testing audio AI is that no one can really see inside the model. A language model for transcription or speech synthesis is a black box whose inner logic cannot be checked line by line like a classic function. You can only see what goes in at the front and what comes out at the back.

With a classic function, you find a rounding error, correct the decimal places and the test is green. With an AI model, there is no such intervention point. If the model pronounces a word incorrectly or does not recognize it, you cannot simply change a rule. You would have to retrain the model and, depending on its size, that would cost several days and several hundred to several thousand euros.

Olaf Thiele has been working with audio AI in German, i.e. speech-to-text and text-to-speech, for years. His findings: the models now work, but the industry is still at the beginning when it comes to testing.

Why reproducibility fails in AI training

AI training cannot currently be reliably reproduced because the hardware itself changes the result. With the common libraries PyTorch and TensorFlow, the same data set can deliver slightly different results on different computer architectures, even if all other variables remain the same.

This becomes a problem as soon as someone wants to recalculate a result exactly. Training usually takes place on hyperscalers such as AWS, Google Cloud, Azure or OVH. There, you don’t know whether you will be allocated the same physical machine in a week, such as a V100 or A100. This means that the framework conditions are not in place to be able to generate an identical result at all.

This question didn’t even arise for a long time. As long as it was just a matter of a model running at all, nobody asked about exact reproducibility. Only now that customers are becoming larger and demanding governance is it moving to the forefront.

When do you stop training? A gut decision

There is no formula for when to stop training. A model goes through several stages, and with each stage it gets better at its internal test set. This is where the trap lurks: the longer you train, the better the model fits the training data, but the generalization to new, unseen data gets worse at some point.

The learning curve flattens out logarithmically. Whether you stop at step nine, ten, eleven or twelve is often decided by instinct. There is also a distribution problem within the data: While a male speaker is recognized better and better as training progresses, a female voice can get worse at the same time. There is no clear answer as to where you stop.

The term “test set” in training here does not mean testing in the sense of software development. It is a small amount of data for internal self-monitoring of the model, not proof of quality in the classic sense.

The golden test set as an aid

The most common tool in audio testing is a golden test set: a carefully compiled collection of examples that can be used to measure the quality of the model. Ideally, the client creates this set themselves because they know best what they want to achieve. In practice, this rarely happens.

A good golden test sentence depicts gradations. In the case of dialects, for example, this means a very strong Bavarian next to a light Bavarian. A few thousand samples can then be used to observe how the model changes over training runs.

The test set also helps to mitigate any errors found. If a model does not recognize a certain word, more examples of this word are added. This is not a real correction in the sense of a targeted bug fix, but rather a readjustment via the amount of data.

The AI doesn’t know what it doesn’t know

The core problem with finding errors: an AI knows no limit at which it says, I don’t know. It delivers a result even if it doesn’t actually have an answer. When recognizing audio, this means that you often don’t know which words a model omits or swallows.

An illustrative example is the pronunciation of loan words. Some people say “budget” with a hard pronunciation, others say it in French. To teach a model to recognize both, you would need thousands of examples of exactly this variant. Where do you get them from?

This gap also applies to testing the output. If you want to test synthesized speech automatically, you run into a circular argument: you would have to send the generated audio through a recognition AI again. If it is trained on the same data, it will miss exactly what it would miss in the original. The basic problem reproduces itself.

Gold in, gold out: data quality determines the result

A model can only reproduce what went into the training data. This rule shapes every application. If you want to synthesize a young, female voice but have predominantly old, male audio material as input, you are fighting against the data situation.

This is precisely where the bias lies, and it permeates the results. The freely available Mozilla voice data consists of around 90 percent male voices. As a result, female voices are less easily recognized and synthesized female voices sometimes sound male.

The effect is amplified if a model is over-optimized for a small number of speakers. If you train for 300 hours with a single prominent voice, you will get a perfect understanding of this person, but other speakers, such as an older woman, will hardly be recognized.

Depending on what I put into such a model, that’s what I get out. That’s why I’m very careful with applications that are supposed to be diverse.
Olaf Thiele

Why German suffers particularly from a lack of data

There is significantly less usable training data for German than for English. A well-known model was trained on 600,000 annotated hours of audio material, of which around 550,000 hours were in English. Only around 2,000 to 3,000 hours remained for German.

If you want a German model at English level, you need a comparable database, i.e. in the order of 50,000 hours of German. Hyperscalers in particular are currently collecting such quantities, and their models are getting measurably better.

Freely available sources are rare. The Mozilla project to collect language has come to an end. Many obvious sources fail due to licensing issues: Bundestag sessions, radio broadcasts, books read aloud from the Gutenberg project. The latter have the additional problem that the read-aloud voice does not sound like a natural conversation.

A positive counterexample is the freely licensed German voice of Thorsten, who recorded around 20 hours with a good microphone. More open data sets like this are needed, including from women and from different regions.

Dialects are desired and hardly feasible

Dialect-capable models almost always fail due to the data situation. Dialects differ greatly from region to region, sometimes from valley to valley. In Switzerland, someone is building a separate model for each valley, supported by public funds.

In Germany, development is more commercially driven, with a clear bias. There tends to be more data from the south and less from the east. As a result, people from the east are less well recognized than people from the south. A Saxon model would require around 1,000 hours of Saxon, which is simply not available.

A language application contains several AIs, each with its own test problem

A language skill does not consist of one AI, but at least three. Using the example of a pizza order, the stations and their respective testability can be clearly separated.

component	task	testability
Speech-to-text	converting audio to text	difficult because it remains unclear who spoke and how
Natural Language Understanding	recognizing intent from text	good testability, as text in, intent out
Text-to-Speech	Outputting text as audio	difficult because automated testing requires an AI again

The middle component, Natural Language Understanding, can be tested in the classic way. You enter a text and check whether the correct intent is recognized. You write such tests as usual, even if platforms like Alexa don’t offer this out of the box.

The two audio ends remain the problem. With speech recognition, you only see the recognized text. It is not possible to reconstruct whether an error was caused by a dialect, mumbling or a bad microphone. With the output, the question again arises as to how you can check hundreds of synthesized answers without letting a hundred people listen.

Generative AI as a tool for generating test data

Language models such as ChatGPT help most directly with testing where many text variations are needed. If you want to test the output of a language skill, you usually have to formulate all the answer variants by hand, and after a few hours you simply can’t think of anything new.

A language model provides 50 or 100 variants of an answer on demand. The temperature can be used to control how free the result is. These texts are then synthesized as audio and listened to at random. In this way, errors that previously slipped through can be found, such as incorrect pronunciation or two words spoken incorrectly in combination because the synthesis is based on phonemes.

The same lever works for the input side. If you generate 100 formulations for a pizza order, sentences come back that you would never have thought of yourself. Each additional sentence in the test data set costs almost nothing and significantly expands the tested spectrum.

One limit remains: For sensitive or dangerous use cases, the error rate of generative models is unacceptable. For internal and preparatory tasks, however, the approach works well.

MLOps and standardization are coming, but audio is lagging behind

Tools for MLOps are gaining momentum, but they have so far been tailored to text and audio is lacking a lot. The reason lies in the amount of data: ten seconds of audio result in windows of 20 milliseconds each, the entire frequency band is evaluated for each window, and neighboring windows are included via a sliding window. This results in huge vectors. A typical training dataset quickly amounts to 400 to 500 gigabytes.

This size makes it difficult to cleanly pull data across training runs. If you vary the composition, such as a higher proportion of women or less fast-talking men, you would need to track exactly which data was contained in which run. It is precisely this tracking that the current tools do not provide.

The Hugging Face platform works like a GitHub for AI models and hosts tens of thousands of models. Its library allows around 20 percent of a data set to be extracted, but does not include the associated metadata. So far, only technical run data has been specified, i.e. what is necessary for a model to start at all. Which data was run in and which parameters were used for training remains open.

Standardization would help the field, especially in terms of comparability. Today, customers operate entire collections of models, a so-called model zoo. Quality assurance and governance across many differently built models is hardly feasible as long as everyone comes up with their own conventions. Anyone who could set a standard would have the best prerequisites with the reach of a platform like Hugging Face.