Why Your CI Pipeline Is Lying to You

Flaky tests are automated tests that fail the repeatability property: they pass sometimes and fail other times for the same code, without any change to the codebase. This destroys trust in CI (continuous integration) pipelines, because engineers can no longer tell whether a failure signals a real bug. The two most common causes are concurrent access to shared data and mutable shared state.

Key Takeaways

A flaky test breaks the foundational property of repeatability, and once developers stop trusting CI results, the entire value of continuous integration collapses.
Flaky tests sometimes expose real production risks: transient network errors, unavailable dependencies, and race conditions in tests are the same failures users will encounter in the live system.
Removing a flaky test from CI rotation is necessary but not sufficient; ownership must be assigned to a named individual, not a team, or the fix will never happen.
Meta’s practice of running every new test a hundred times concurrently overnight before admitting it to CI is a concrete, automated gatekeeping strategy that prevents flaky tests from entering the build at all.
Deleting a flaky test is a legitimate resolution, especially when lower-level tests already cover the same risk, because shipping changes quickly and safely matters more than preserving test code.

What makes a test flaky

A flaky test is one that breaks the rule of repeatability. Run it again and again, and it should always give the same answer, a pass or a fail, but always the same. A flaky test does not. It passes, it fails, it passes, it passes, it fails. That inconsistency is the whole problem.

Good tests share a few qualities. They are self-checking, so no human has to confirm the result. They are fast, because feedback should come back quickly. They are isolated, so they can run concurrently without stepping on each other. And they are repeatable. Flakiness attacks that last quality and nothing else.

The damage shows up in your continuous integration system. You set up CI for one reason: to verify whether a change is safe. When the build goes green, every automated check says the change is as safe as it can be. A flaky test poisons that signal. You stop knowing whether red means a real failure or just noise.

Why flaky tests cost you trust, not just time

The deeper cost of a flaky test is trust in the CI pipeline. Anything that fails to give consistent signal, whether pass or fail, erodes that trust. Once you no longer believe the build, you might as well not run it.

The time loss is real too. Most teams have lived through the ritual: the build goes red, someone shrugs and says “oh yeah, it just does that,” and reruns it. Sometimes it passes, sometimes it fails. Only on the second or third run does anyone start to wonder if this is a genuine failure. By then you have stretched your feedback loop and burned the value CI was supposed to give you.

Fast feedback on a change is one of the central aims of software engineering. Flaky tests work directly against that aim.

A flaky test can be a useful test

Flakiness does not automatically mean the test is bad. Sometimes it is pointing at something worth knowing. When the test passes, it tells you that when everything lines up, the code works. When it fails, it may be exposing a real weakness in your system.

The failure causes are rarely the test framework itself. More often it is a transient network error, a dependency that is briefly down, a database or message queue that is unavailable, or a race condition. Those are exactly the conditions your production code will face once it is in users’ hands.

So a flaky test can be a blessing. It pushes you to ask why the test is failing and whether your code is robust enough to survive the same situation in production. You cannot expect 100 percent uptime from your dependencies. If a transient failure breaks your test, the same failure will reach your users.

The common reflex is wrong. Developers find a flaky test and blame the test or the infrastructure: “it should have just worked.” Often the honest answer is that the code should have handled the hiccup and did not.

Why flaky tests are hard to fix

Tracking down the cause of flakiness takes real time, and that is why engineers avoid it. A repeatable failure is far easier to debug than one that surfaces every twenty runs. When the symptom barely shows up, the investigation drags.

Two causes dominate. The first is concurrent access to shared, mutable state. In Java, for example, a static value lives across the entire JVM. One test modifies it, a later test reads or changes it, and everything is fine in that order. Run the tests out of order, and failures appear out of nowhere.

The second is changing shared state in something like a database. If the first line of your test drops everything in a table while another test is midway through modifying that same table, you have a disaster. A reliable way to spot both patterns: the test passes in isolation but fails when run alongside others. That is your signal that you have found something.

Race conditions are the worst to pin down. Simon Stewart once wrote a custom zip compression implementation in Java that passed, then failed twice, then passed again, with code that looked solid. The cause was timestamp granularity. The zip format keeps timestamps accurate to two seconds, not one. Starting the test on an even second passed, an odd second failed. Normalizing the timestamps made the test rock solid, but the hunt was brutal.

How to handle a flaky test, step by step

The first move is to take the test out of CI. Your CI must give strong signal, which means the flaky test has to come out of rotation. From there, a clear sequence keeps the test from being lost.

You have two ways to remove it. The simplest is an ignore annotation, available in most test frameworks. If you use one, link it to an entry in your bug tracker so the test stays traceable. The alternative is a block list, kept in the repo or elsewhere, that names the tests to skip on any given run.

The Selenium project uses a skipped-tests file fed into CI. A single file collecting every test you have skipped beats scattered annotations, which are hard to find and easy to forget. One place to look means you can always go back and find what you suspended.

Step	What you do	Why it matters
Remove	Take the test out of CI via annotation or block list	Protects the strength of your CI signal
Track	Link to a bug or keep a skipped-tests file	Keeps the test visible and recoverable
Own	Assign the test to one named person	Avoids the tragedy of the commons
Resolve	Debug, shrink, retry, or delete	Returns trustworthy signal or removes dead weight

Most companies stop after removing the test. The build is stable again, so they feel done. But unless someone owns the test, nothing happens next.

Assign every flaky test to a person, not a team

Ownership of a flaky test belongs to an individual, never a team. Hand it to a team and you get the tragedy of the commons, where everyone assumes someone else will deal with it.

Owning the test does not mean doing all the work. It means being responsible for figuring out how it gets fixed. Maybe the owner reassigns it. Maybe they treat it like any other incoming bug. But one named person carries it forward.

Once a test has an owner, you resolve it. Run it repeatedly and try to isolate the cause. Find out why it is flaky, then write the smallest possible test that proves your fix works. Do that consistently and you drift toward the classic test pyramid: many small tests at the base, few large ones at the top.

Deleting a test is a legitimate fix

You can also delete the flaky test. As long as you use source control, deletion is not destruction. If the test turns out to be valuable, you bring it back.

People resist this because the test “covers an important workflow.” That argument cuts both ways. If a workflow is genuinely crucial, it is worth spending the engineering time to make the test rock solid. If it is not worth that time, the test is probably not worth keeping. The more important you claim it is, the higher the priority to get it stable.

Some teams put a time to live on a flaky test. If it is not fixed before the deadline, it gets deleted as dead code. It provides no value and adds weight, so removing it is fine.

Layered testing strategies give you another reason to delete. When earlier, smaller tests already cover the risk a large test was guarding, the large test becomes redundant. Flakiness tends to live in those large tests, so removing one whose coverage exists elsewhere is a clean win.

When retries are acceptable, and when they cost too much

Rerunning a test in the same build is an acceptable compromise, not an ideal. If a test has a five percent chance of flaky failure, a second run drops the combined odds to five percent of five percent. The probability of a false red decays fast with each retry.

The cost is build time. A flaky test is usually a large one, because it has the most moving parts. Retry a four-minute test five times and you blow your time budget for getting results back.

Around three retries works well on average, but judge it against your own service level objective. If you want CI results within ten minutes and an average test takes three minutes, you can run it three times but not four. There is a balancing act between confidence and feedback speed.

How to stop flaky tests before they enter the build

Gatekeeping keeps flakiness out in the first place. Do not let a flaky test into your build at all.

Meta runs this at scale. Before a test is allowed to affect CI, it runs a hundred times concurrently overnight against every other test. Only if it stays completely stable does it join the real CI runs, and the whole process is automated. No human could carry that load.

You do not need Meta’s resources to apply the idea. Before adding a test, run it ten times locally and watch for instability. This does mean keeping a record of which tests already exist, and trivial changes like renaming a test will trigger a rerun. That is acceptable, because a stable test passes the gate and a flaky one should never have gone in.

The second preventive habit is rerunning known-flaky tests periodically. Sometimes a systemic cause gets fixed elsewhere and a whole batch of tests quietly starts passing again. A separate file listing failed tests makes this easy, since walking a list and removing one line beats hunting through code to strip annotations. Once those tests prove stable, they rejoin the general CI runs and provide trustworthy signal again.

Where AI helps with flaky tests, and where it does not

AI earns its place in debugging, not in writing your tests. It analyzes intricate code well and can point out that a value is never set or that a problem sits on a particular line. For chasing down the cause of a flaky failure, that is genuinely useful.

The risk is false confidence.

It’s like a drunk man in a bar with an opinion, and they will say it very emphatically, and they might well be wrong. Simon Stewart

So treat every AI output as something a human reviews. One counter is to ask a second model to verify the first model’s diagnosis. If both agree that a given line explains the failure, you have a reasonable place to start.

AI is good at generating many plausible test cases, but Simon Stewart prefers to define the conditions himself rather than hand that over. The framing that holds up: AI is a power tool for your work, not a replacement for it. And the broader aim still stands. Get a safe change into users’ hands as fast as possible, which means the less code you carry, the better.