Quality from and with Prompt Engineering

Prompt engineering for test generation means eliciting better test cases step by step from a large language model through targeted prompt patterns. Proven techniques include few-shot examples for robustness, embedded library specifications for correctness and the ReAct pattern for stable conversation sequences. Correctness remains the central unresolved weakness.

Key Takeaways

Prompt engineering is no substitute for human review: Even after iterative improvement with multiple prompt patterns, two out of eleven generated test cases remained incorrect because the model does not reliably guarantee correctness.
Few-shot examples in the prompt measurably increase robustness: They prevent the model from delivering completely different or absurd results with minimal prompt changes.
Anchoring domain-specific knowledge directly in the prompt improves correctness more than general prompt tuning: Inserting a concrete library specification has significantly increased the hit rate of the test cases.
The ReAct pattern (Reasoning plus Action) is the most robust prompt pattern from David’s experiments: It completely prevented hallucinations in the tested scenarios and kept prompts stable against variations.
Testers are better qualified for prompt engineering than many other roles because they are used to checking correctness, specifying requirements, and constructing representative examples.

Prompt engineering is the discipline that makes language models usable for quality work

Prompt engineering refers to the targeted design of instructions to a language model so that it reliably fulfills a specific task. It is the alternative to fine-tuning, in which a model is retrained with collected and labeled data. In prompt engineering, the model remains unchanged and is controlled solely by the text of the query.

David Faragó has been using this approach since the release of GPT-3. His first serious attempt concerned the evaluation of commit messages. Instead of elaborately fine-tuning a model, he formulated a specification based on existing guidelines and supplemented it with examples. The model then assigned scores for the quality of commit messages, with surprisingly good results.

For testers, the issue is closer than it initially sounds. Many prompt patterns revolve around quality and non-functional aspects. This is the bread and butter of testing.

Why you can’t rely on the correctness of generated tests

Language models are not guaranteed to deliver correct results when generating tests. This is the central limitation, and it weighs heavily as soon as something specific rather than something creative is involved.

David describes two moments that dampened his initial euphoria. In the commit message experiment, the model initially delivered a poor score with a convincing justification. A minimal change, a comma to a semicolon, completely overturned the verdict: the same weak commit message suddenly got a top score. The result was simply wrong.

The second setback came from ChatGPT with a statistical problem. The answer sounded plausible throughout, but contained an error of around ten percent in the middle. David used the suggested method and ran into a dead end for one or two days because he hadn’t questioned the result enough.

With creative tasks such as a poem, a deviation in content is tolerable. Not so with test cases. If a generated test fails, you look for the error in the system under test, but it is in the test itself.

The thing can do a lot, but I wouldn’t put my hand in the fire that it always delivers a correct result. David Faragó

How to improve a prompt step by step

A good prompt is created iteratively by applying proven prompt patterns one by one and comparing the result. Just as there are design patterns in software development, there are documented prompt patterns, and new ones are added every week.

David has selected around five to ten patterns that have proven themselves in practice and successively added to them using a single example for test generation. The example was deliberately kept small, a string processing with Python and PyTest.

With each iteration, one aspect got better, while another got worse. Test coverage and structuring of the tests improved noticeably. Correctness remained the sore point. The final result was significantly better than the first attempt, but not flawless.

Two levers determine quality: robustness and correctness

Two levers had the greatest effect on the test quality. Robustness describes whether the model delivers a comparable result when the prompt is changed slightly or simply repeated, instead of producing something completely different.

Concrete examples in the prompt increase robustness. This approach is called Few-Shot Learning. It does not ensure that tests are always identical or always correct, but it does prevent the absurd cases of the model copying the entire system-under-test code into the test file, inserting unnecessary imports or writing long explanatory texts before and after the test code.

The second lever is correctness, and a simple measure helped here. David worked with a dependency called difflib. The model knew the module and described the underlying algorithm when asked, but remained vague. Only when he copied the difflib specification, about four to five paragraphs of text, directly into the prompt did the correctness of the generated tests improve noticeably.

The rule of thumb: don’t rely on the model to retrieve its own knowledge accurately. Provide the relevant specification, even if the model theoretically knows the content.

Conversation beats single prompt: the model as its own tester

A single prompt is not the end. With coverage criteria in the prompt, David generated eleven test cases from the string example. The preliminary stage without the additional coverage requirement had returned five tests, all correct. With the prompt to increase the test coverage, eleven tests came out, some of them more difficult, two of which were incorrect again.

Instead of further optimizing this one prompt, David switched to a conversation. There is a special pattern for this: you let the model judge for itself whether its previous output has fulfilled the task set.

This is intuitively comprehensible. A generative language model produces word by word. Assessing a finished result retrospectively is an easier task than generating it without errors. It’s the same with humans: it’s easier to assess whether something is good or bad than to create something good yourself. In David’s case, the model recognized the two faulty test cases as such.

Which approaches really help against hallucination

Correctness is most likely to be improved via well thought-out prompt sequences, less via specialized models. David tried out the Constitutional AI model Claude from Anthropic, a company run by former OpenAI employees. When asked if it was hallucinating, the model replied that it was a Constitutional AI model and was not hallucinating. The generated test cases then had about the quality of David’s first, weak ChatGPT attempt. When it came to correctness, the result was disappointing.

He was more impressed by the React pattern. React stands for Reasoning and Action and has nothing to do with the front-end framework. The reasoning part is similar to Chain of Thought, where you instruct the model to think step by step before it outputs a result. The action part teaches the model to make external calls or obtain information from outside.

In a chatbot experiment for MediForm, David experienced very high robustness with a cleanly written React pattern. He was able to correct a single aspect in the prompt, everything else remained stable. In the handful of experiments he did, no hallucination occurred.

The React pattern can also be found in common tools. Auto-GPT goes in the direction of GPT-Agent and combines a whole sequence of prompts with plugins and external API calls. The LangChain library, which can be used to program such agents, uses a React-like pattern under the hood.

Testers have an advantage as prompt engineers

Testers have exactly the mindset that prompt engineering requires. Correctness is the central weakness of the models, and quality is the core competence in testing.

Many prompt patterns focus on quality and non-functional aspects. Selecting good few-shot examples for a prompt or formulating a clean requirements specification is the job of testers. The scrutinizing, skeptical view of a result is an advantage in prompt engineering, not an obstacle.

How to start small

It’s worth starting with small, daily tasks. The return on investment is high, the effort low.

David uses ChatGPT on the side for manageable problems: recalling a library that he hasn’t used for a while or interpreting a long, unknown stack trace. The model often names the cause of the error directly. Where you used to switch to Stack Overflow or Google, today you can switch to the language model.

The dialog is the real advantage over a search query. If a topic requires more interaction and you want to clarify sub-aspects in follow-up questions, the model is more useful than a single Google search. In this way, you gain experience that will later help you with systematic prompt engineering.

Where the field is heading: open models and more efficient fine-tuning

Open models and new fine-tuning techniques are key to further development. Meta has published Llama as open source, albeit not for commercial use. In contrast to pure prompt engineering, the model itself is available here and can be fine-tuned to improve properties such as correctness.

Model sizes range from around three or six billion parameters to 13, 30 or 60 billion. It is difficult to fine-tune larger models on the hardware side. On the other hand, new optimizations are created every week: Quantization, fine-tuning only certain parameters or remembering weight differences instead of the weights themselves.

A leaked Google document entitled “We have no moat” argues that neither Google nor OpenAI have a permanent lead because open source development and openly published research are rapidly closing the gap. David considers the document to be one-sided, but sees a kernel of truth in it.

At this disruptive stage, it is difficult to predict how much is still needed to build a system that is so robust and correct that it can be used in a safety-critical area with a clear conscience. Some things with a big wow effect turn out to be unsuitable in practice, while more inconspicuous approaches become really useful.