What are AI Agents and how do they impact the future of software testing?

AI Agents are intelligent systems designed to perform tasks autonomously or with minimal human intervention in software testing. They play a significant role in revolutionizing testing processes by enhancing efficiency, accuracy, and scalability, thereby shaping the future of software testing.

What types of AI Agents are commonly used in software testing?

In software testing, common types of AI Agents include rule-based agents that follow predefined rules, machine learning agents that learn from data patterns, and hybrid agents combining both approaches to optimize testing outcomes.

What are the key benefits of using AI Agents in software testing?

AI Agents offer numerous advantages such as faster test execution, improved defect detection, enhanced test coverage, and the ability to handle complex testing scenarios more effectively than traditional methods.

What challenges might organizations face when implementing AI Agents for testing?

Organizations may encounter challenges like ensuring high-quality data for training AI models, addressing the need for skilled personnel to manage AI systems, and overcoming integration complexities within existing testing frameworks.

How can trustworthiness and reliability be ensured when using AI Agents for testing?

Trustworthiness can be ensured by implementing rigorous validation methods, maintaining transparency through explainability techniques, continuously monitoring AI performance, and adhering to quality standards to guarantee reliable test results from AI Agents.

What is the evolving role of humans in software quality assurance alongside AI Agents?

While AI Agents automate many aspects of software testing, humans remain essential for overseeing complex decision-making, interpreting AI outputs, managing exceptions, and fostering collaboration between human expertise and machine intelligence to ensure overall software quality.

AI Agents & the Future of Testing

AI agents in software quality refers to autonomous software components that handle tasks inside a pipeline or business process, replacing or augmenting human work. Testing them requires treating AI like a new hire: a selection process, a probation period, and continuous performance evaluation. Because AI output is stochastic and never identical twice, stored prompt logs, automated comparison checks, and human review loops are the core quality controls.

Key Takeaways

AI agents need a hiring and continuous performance evaluation process, just as human employees go through selection, probation, and ongoing review before being trusted with business tasks.
Storing every prompt that generates an outcome in a log enables audit trails and statistical analysis, making it possible to trace why an agent’s output changed over time.
Testing AI output can be done by using additional AI models to evaluate results: if a clear majority agree the output is good, it is likely acceptable, and disagreement signals a problem.
Keeping business process knowledge in-house is a competitive requirement because handing that knowledge to an outside AI provider strips the company of its core differentiator.

AI agents are your new workforce, not just another piece of software

Treat an AI agent the way you would treat a new colleague, not the way you treat a webpage or a billing system. That shift in perspective changes how you build trust, how you evaluate output, and how you decide whether the agent stays.

A human hire goes through selection, interviews, and a probation period. After that, performance gets watched, and a bad fit can be let go. Trust in people is never blind. It rests on entry criteria and continuous evaluation.

Szilard Szell argues that AI deserves the same treatment. An agent needs a hiring process, a selection process, and ongoing performance checks. If an agent stops performing, you evaluate why and you move on to something better.

There is a financial reason to keep evaluating rather than discard. AI is expensive enough that throwing it away rarely makes sense, especially when the imperfection is fixable. Often the fix is a changed prompt, a different input format, a check of the guardrails, the memory, or the tools the agent can reach.

Why “input in, expected output” breaks down with AI

The classic tester’s contract does not hold for AI. You put something in, you expect a defined result, and you compare. With a stochastic system, you will never get the same result twice, so that contract collapses.

Asked directly whether testers can guarantee an AI works well, Szilard answered in short: we can’t. That honesty is the starting point, not a dead end.

What replaces the fixed expectation is a semantic one. A tester can describe what good looks like in words, and AI is strong at comparing that description against an actual output. You move from exact matching to evaluating fit.

You can also let multiple AIs check the output of one. If eight out of ten judge a result good, it is probably good enough. If many flag a problem, there is a problem worth investigating.

How to test something that answers differently every time

Every AI output needs to be checked and evaluated, not assumed correct. When you track results over time and the output shifts, you investigate the cause, whether the new result looks better or worse.

A shift can have several sources. The underlying LLM may have been updated in the background. The memory may have been wiped or changed. Without logs from the system, you cannot tell which.

Store the prompts. Szilard makes a firm case that every prompt generating an outcome belongs in a log. That gives you an audit trail to trace back, and it lets you analyze prompts statistically: how they changed, what happened afterward, where quality improved or degraded.

Reuse the quality practices you already have

You do not need to invent quality assurance for AI from scratch. The knowledge testers already hold transfers directly to agents and to agentic workflows where several agents work together.

Keep a human in the loop by applying your review practices to the agent’s output. Run your low-level testing and code analysis tools against generated code. A noticeable share of AI-generated code carries security vulnerabilities, and vulnerability scanning tools catch them.

The feedback loop is the point. When you feed the scan results back to the AI, it gets better at producing code without those vulnerabilities. The CI/CD pipeline and your existing QA practices are exactly the place to plug agents in.

Design the agent like you onboard a new hire

Building an AI agent comes down to two problems: communication and correctness.

The communication problem is the same one you face with a newcomer. You have to state the task clearly, give the context, describe the expected behaviors and ways of working, and provide the input. You also define the output you expect. Good examples matter, because the agent learns from them, mimics them, and returns more correct answers.

The correctness problem is the Oracle problem. How do you know what good looks like when the system is random? You answer it with semantic descriptions, multiple checkers, and continuous evaluation rather than a single fixed assertion.

Memory is where design gets dangerous. An agent stores information between sessions and updates its memory from feedback. If you do not understand how that works, the agent may pull context from memory that does not belong to the current task.

So you decide deliberately when the agent updates its memory and when it purges it. You also decide what it should never hold, such as passwords, bank data, or account details.

Build your own AI, do not buy it off the shelf

You should own the development of your AI agents, because they run on your knowledge and your business processes. Outside help is fine, but the secret sauce stays yours.

European bureaucracy turns out to be an asset here. Process descriptions, role descriptions, and value stream maps are already written down, and you can reuse them to spot which activities are worth augmenting with an agent and where the return is highest.

The richer source is people. Interview a handful of practitioners and they will tell you how they actually do the work, the best practices, the small things they care about. That is what you feed into the agent.

Consider what is already gone. Your data sits in the cloud, so it has been handed over in part. Give away your processes on top of that, and the question becomes what your company still is.

An off-the-shelf agent works as a starting point. Take it, but understand how it works and improve from there rather than treating it as finished.

Do we actually trust humans more than AI?

The trust debate has a blind spot. We declare that we do not trust AI while assuming we trust humans, and that assumption does not survive scrutiny.

We never extended humans unconditional trust either. We built selection gates, probation, and continuous performance review precisely because trust has to be earned and maintained. Apply the same machinery to AI and the trust question becomes manageable.

The real fear is a different one. Szilard is open about it: the moment an AI boss starts telling him what to do unsettles him more than an AI colleague does.

DevOps 2.0: swarms of small agents solving big problems

The next phase is DevOps 2.0, the same hyper-fast feedback cycles and ways of working, now augmented with AI agents. Expect many small agents, each handling a small but clever task, each carrying a kind of CV that states what it is good for.

These agents form swarms that combine to solve large problems. The risk is not being “vibed out” by automation. The risk is losing sight of what is happening and why, and no longer being in control. Szilard frames this as the next step in an evolution where we have already given away a lot of that control.

So agents, focusing on solving small tasks, but clever, smart tasks, working together, solving big problems. — Szilard Szell

Picture the loop end to end. An agent listens to a customer feedback call and reads the tone of voice, even whether the caller was angry. It proposes the next feature or improvement, which moves through a chain of agents.

The change runs through automated tests and checks, then deploys. It can go out as an A/B test or a canary release for a persona group resembling the caller, selected by AI. An agent might even run crowd testing with people similar to the original complainant.

Within hours or days, a proposal runs in production. Continuous monitoring, telemetry, and observability tell you whether the change is better or worse, the same way they do for any human-made change. If it works, it stays.

What testers should learn now to stay in charge

Start with how quality is built from the very beginning, and what actually affects it. From there, learn how agents and agentic workflows operate and where their risks sit.

Apply your risk management skills up front, before deployment. Once the AI is in production, keep the ability to evaluate how it performs and to react fast. If features start getting worse, you need to switch the code back to the original version, so build those control points in.

The personal move matters as much as the organizational one. Get your own assistant, your own agent, and grow more hands for yourself.

So have the control points in there, but learn how AI works and learn how AI works for you. — Szilard Szell