Why Traditional Testing Fails for AI Systems

Testing chatbots shatters the pass-fail logic of traditional software testing because the same question can yield countless valid answers, and bugs don’t live in the code – they hide in prompts, retrieval strategies, and vector embeddings. Unlike conventional QA where one test case is correct and the rest are negative scenarios, chatbot testing demands evaluating context retention, hallucination control, and semantic relevance across non-deterministic outputs. The real challenge isn’t finding errors in an application, but tracing invisible failures in data retrieval, chunking strategies, and response generation that no magic tool can fully automate yet.

Podcast Episode: Why Traditional Testing Fails for AI Systems

This time I talk with Dušanka Lečić about why testing chatbots breaks everything we know about traditional QA. She explains how chatbot bugs are invisible – they hide in prompts, retrieval logic, and chunks, not in code – and why the same input can produce dozens of valid outputs. Dušanka shares her framework for testing context retention, hallucination control, and accuracy, and reveals why stress testing a chatbot means checking for typos and user frustration, not system load.

“For the same input we have a lot of different outputs, some of them can be similar, but yeah still non-determinism is completely there.” - Dušanka Lečić

Dušanka Lečić is a dynamic leader and technical expert with nearly a decade of experience steering software testing initiatives across international teams. As a Test Lead and Department Manager at Levi9, she specializes in performance testing, agile methodologies, and engineering excellence. Holding a Ph.D. in Technical Sciences, Dušanka blends academic insight with real-world execution, and is a frequent contributor to industry conferences, mentoring programs, and expert communities. Her sessions offer a rich perspective on quality assurance, innovation, and leadership in fast-paced development environments.

Highlights der Episode

Chatbot testing requires multiple valid test cases, unlike traditional testing’s single pass scenario.
Bugs in chatbots are invisible—hidden in prompts, retrieval logic, or generation, not code.
Context retention across conversations matters more than isolated correct answers in chatbot testing.
Stress testing chatbots means checking typos and frustration wording, not performance loads.
Manual testing remains essential; no single tool automates complete chatbot quality verification yet.

Testing Chatbots: Navigating the Challenges of Invisible Bugs

Unlike traditional software, chatbots introduce layers of unpredictability and complexity that throw familiar testing approaches out the window. In a conversation hosted at the TestWarez conference, Dušanka Lečić, a seasoned test lead, shared her journey through the challenges of chatbot testing—a journey marked by invisible bugs, unexpected failure types, and an evolving need for specialized tools and strategies.

Chatbots, especially those that integrate retrieval-augmented generation (RAG) systems, don’t play by the usual rules. While traditional apps might serve up predictable, repeatable outputs tied to specific inputs, chatbots can offer a range of valid responses for the same query. This non-determinism shifts the focus to how well chatbots preserve context, handle ambiguity, minimize frustration, and maintain accuracy—none of which are reliably covered by customary pass/fail scenarios.

Invisible Bugs: Understanding the New Defects

In classic software, bugs are often embedded in code and traceable through source files and logs. With chatbots, however, Dušanka Lečić emphasized that bugs are often “invisible.” These defects can emerge from various non-obvious sources:

Retrieval logic: Flaws in how the chatbot fetches data from databases or knowledge sources can generate incorrect or irrelevant responses.
Prompt structure: Subtle mistakes or inconsistencies in prompts can mislead even well-trained models.
Response generation: The model may hallucinate or invent answers based on ambiguous queries, poor chunking, or incomplete training data.

These origins make troubleshooting especially demanding, as issues may not be evident in the code but manifest in the interplay between data, queries, and the model’s learning process.

Shifting from Traditional to Hybrid Testing Approaches

Because there is rarely a single valid response to any prompt, Dušanka Lečić described embracing a hybrid approach:

Manual Exploration: Carefully designed manual scenarios reveal issues in context retention, hallucination, and overall user satisfaction. Testers might assess how well a chatbot remembers earlier conversation threads, avoids repeating itself, and produces answers tailored to user needs.

Automated Checks: While automation eases regression and bulk validation, it only partially addresses the intricacies of conversational AI. Automated routines might spot spelling error tolerance, ambiguous requests, and performance limits, but nuanced failures often still demand human insight.

Stress Testing Redefined: Stress testing for chatbots means confronting them with misspellings, ambiguous phrases, and real user frustration—not just flooding them with traffic. The aim is to see how resilient and forgiving the chatbot is in genuine conversation.

Rethinking Test Planning and Documentation

With so many possible “correct” outputs for a single query, keeping track of test cases and documenting new findings is more involved than ever. Dušanka Lečić highlighted how AI can assist here, helping generate test plans or record conversation paths. Nevertheless, documentation remains a challenge, particularly when every tweak to the underlying model or data can alter chatbot behavior in unpredictable ways.

Leveraging test management tools with AI features, or automating repetitive documentation tasks, can ease some burdens, but testers are still left cataloging a web of positive and negative cases that balloons well beyond what’s typical in traditional apps.

Tooling Woes: The Missing “Magic Tool”

Amid these innovations, one thing remains clear: the right toolchain is still missing. Dušanka Lečić noted the absence of a comprehensive, all-in-one tool for chatbot testing. While specialized tools like Ragas exist for testing specific aspects of RAG systems, there’s still a heavy reliance on manual work and cobbled-together solutions.

Testers juggle a mixture of in-house scripts, manual procedures, and partial automation—leaving room for improvement as the ecosystem matures. The hope is that as chatbot technology and quality practices evolve, so too will the suite of dedicated tools.

Continuous Learning and Collaboration

To keep up, teams must commit to ongoing learning and collaboration. Dušanka Lečić and her colleagues invest in research articles, conference talks, and hands-on experimentation, sharing insights and strategies internally.

This openness is crucial not just for spotting new bug types, but for keeping testing relevant as AI-driven systems become more central to the user experience.

Chatbot testing challenges conventional software QA models. Bugs lurk outside the code, responses are unpredictable, and context is king. Success in this evolving space requires both the creativity to invent hybrid testing strategies and the humility to document the unknown. As the industry grows, so will the tools and tactics testers use to ensure these invisible bugs don’t go unnoticed—and users receive reliable, relevant answers every time.