Flaky tests destroy trust in CI systems and lengthen feedback loops, but deleting them immediately might mean missing critical production issues. The flakiness often signals real problems – transient network errors, race conditions, or dependency failures – that production code will eventually face. Effective strategies include running tests 100 times before allowing them into the build, maintaining a separate blocklist file instead of scattered ignore annotations, and assigning individual ownership to ensure flaky tests get resolved rather than forgotten. When a test remains unstable despite investigation, deletion becomes a valid option – especially when underlying risks are already covered by smaller, more reliable tests.
In this episode, I talk with Simon Stewart, professional software developer and former lead of the Selenium project for over 10 years, about one of the most frustrating problems in software testing: flaky tests. Simon reveals why a flaky test isn't always a bad test – sometimes it's actually exposing real production risks that your team needs to address. We dive into practical strategies for handling flakiness in CI pipelines, from gatekeeping techniques used at Meta to knowing when it's actually okay to delete tests. You'll learn why assigning ownership to individuals (not teams) is crucial, and how to use test flakiness as valuable signal rather than just noise.
"A flaky test can actually sometimes be a good test because it's highlighting things." - Simon Stewart
Simon Stewart has been a professional software developer since before the millennium began. He was the lead of the Selenium project for over a decade and is the co-editor of the W3C WebDriver and WebDriver Bidi specs.As well as browser automation, Simon is also interested in monorepos, blazing fast byte-for-byte reproducible builds, and scaling software development efficiently. He draws on his experience working in Open Source, ThoughtWorks, Google, and Facebook. He was the tech lead of Facebook’s build tool team, and is currently working on projects using Bazel, for which he’s the maintainer of several rulesets.Simon lives in London with his family and dog.
Flaky tests are nearly a universal pain point for modern software teams. During a lively episode of Software Testing Unleashed at HUSTEF Conference in Budapest, Simon Stewart—best known for leading the Selenium project—joined Richie for an honest look at what flaky tests actually are, why they matter far more than we often admit, and how teams can address them for meaningful software quality improvements.
Many teams treat flaky tests as little more than background noise: annoying, unpredictable, and best ignored. Simon Stewart brought much needed clarity, defining flaky tests not by their irritant status, but by their failure to be repeatable. In other words, a flaky test is one that sometimes passes, and sometimes fails, even when nothing in the code or environment has changed. This undermines the critical property of repeatability, a cornerstone of valuable automated testing.
A truly valuable test is self-checking, fast, isolated, and—most importantly—repeatable. If any of those pillars are missing, the feedback loop for developers—and the trust in the system—breaks down quickly.
The problem with flaky tests runs deeper than simply irritating engineers or slowing pull request reviews. As Simon Stewart explained, “Anything which doesn't give consistent signal, be that a pass or a fail, destroys trust”. When developers no longer trust the results of their CI pipeline, they lose confidence that automated checks offer meaningful protection. People stop checking CI statuses, accept green builds grudgingly, and waste time re-running builds until they "get lucky." These scenarios lengthen feedback loops and undermine the entire automation investment.
While the symptoms of flakiness are obvious, root causes are many and varied. According to Simon Stewart, the top two patterns are:
Concurrent Access to Shared Data: When tests or systems access shared (often mutable) state—such as static variables in Java or shared database tables—order of execution affects outcomes.
Race Conditions & Timing Problems: Certain bugs only reveal themselves occasionally, often due to timing mismatches, network dependencies, or APIs behaving inconsistently.
These are compounded by realities such as unreliable third-party services, network blips, dependency outages, and tricky setups (like clock drift or timestamp rounding errors) 15:16.
Remove Known Flaky Tests From CI Immediately: The first defense is clarity: flaky tests must be taken out of the mainline CI runs without delay to preserve signal. Use annotations or, better yet, a centralized "skipped tests" file. In projects like Selenium, a single text file lists all flaky tests, simplifying both tracking and eventual rehabilitation.
Assign Ownership: Unowned work rarely gets done. Every skipped or ignored test should be owned by a real person, not a team or "the commons," or it risks languishing indefinitely.
Investigate or Delete: Once a flaky test is out of rotation, run it repeatedly to try to identify the issue and write small, targeted tests to validate underlying faults. If a test cannot be fixed or loses relevance, consider deleting it—source control allows recovery if truly necessary.
Use AI Thoughtfully: AI tools can support debugging by reviewing intricate code, but always require human oversight. Trust, but verify—sometimes a second AI opinion helps validate suggestions.
Simon Stewart underscored prevention: never allow a flaky test into the build. Gatekeeping through repeated pre-merge test runs (Meta runs some tests 100 times before “promotion” to CI) is highly effective. Even smaller organizations can adopt similar approaches by running new or changed tests multiple times before accepting them.
Unfixed tests that linger serve no one. Many teams apply a time-to-live (TTL): if a flaky test isn’t restored to stability in a set period, delete it. Also, if coverage is duplicated elsewhere—particularly in lower, more stable layers—consider retiring that test.
Flaky tests reveal true system weaknesses, degrade trust, and lead to wasted engineering hours if not addressed systematically. Prioritize feedback stability, transparency about ignored tests, individual ownership, and gatekeeping strategies to reclaim your CI signal—and your developers’ confidence.