Skip to main content

Search...

How Property Based Testing helps

A bug that only appears after 17 exact steps sounds untestable. Property-based testing finds it anyway, by generating cases humans would never think to write.

9 min read
Cover for How Property Based Testing helps

Property-based testing is a method where you describe the behavioral properties of a system and let automated tools generate thousands of test cases from that description. It replaces manually written test cases with a three-part process: a specification model, a generator that creates combinations automatically, and a shrinker that reduces a discovered failure to its exact cause for human review.

Key Takeaways

  • Property-based testing generates thousands of test combinations automatically from a formal description of system behavior, reaching failure modes no human-authored test suite can anticipate.
  • In distributed systems with many services, the number of possible failure combinations grows exponentially, making manual test case authorship impractical at scale.
  • The shrinker component is what makes property-based testing usable: it reduces a long generated failure sequence to the precise step where the fault occurred, so engineers can act on it.
  • Property-based testing becomes costly where each test execution consumes billable resources, such as cloud storage writes or live network calls, because automated test volume can spiral expenses out of control.
  • Introducing property-based testing in a large organization is primarily a cultural challenge: winning one willing team first, letting them advocate to peers, is more effective than a top-down rollout.

What property-based testing actually does

Property-based testing generates test cases automatically from a description of how a system should behave. Instead of writing fixed inputs and expected outputs by hand, you state a rule that must always hold true, and the tool produces thousands of combinations to challenge that rule.

The starting point is scale. In large microservice landscapes, a single failing service can pass its failure down the chain, and you may not notice until the fifth or sixth service collapses. Nikhil Barthwal frames the math plainly: with n services and failures that surface when two or three of them fail together, the number of failure modes climbs toward the cube of n. With thousands of services, that pushes into the millions or billions of combinations.

No one writes a million test cases by hand. That gap is the reason property-based testing exists.

A simple example makes the idea concrete

Take a service that accepts two inputs, A and B, and returns A plus B. The conventional approach is to feed it zero, one, two, some negatives, some positives, check the results, and move on.

The trouble starts at the edges. Integer overflows and buffer overflows show up only at certain combinations, and a handful of hand-picked inputs will sail straight past them.

A property-based approach describes the rule instead. If A plus B equals C, then C minus B must equal A. You state that relationship once. The tool then generates thousands of combinations of A and B, computes the output, subtracts, and checks that it lands back on A. You never enumerate the inputs yourself. You describe the property, and the system writes the cases.

The three components: model, generator, shrinker

Every property-based setup rests on three parts that work in sequence.

The first is a modeling language. For the addition example, that is a one-line equation. For real systems, it is a formal specification. Nikhil uses TLA+, though several formal specification languages exist for the same purpose.

The second is a generator. It reads the specification and produces the test cases, hunting for violations. You can write a custom generator or rely on a standard one shipped with the framework.

The third is a shrinker, and it solves a problem the generator creates. Because the generated cases are long and machine-built, a failure can arrive as a sequence of many steps. The shrinker reduces that sequence down to the smallest reproduction, so a human can see exactly where things broke.

Why long failure sequences need a shrinker

Generated test cases often fail only after a long chain of steps, and a raw chain tells a human almost nothing.

Nikhil points to a test run against a Google LevelDB database. A bug surfaced only after seventeen specific steps were replicated, the kind of sequence no person would think to anticipate. A property-based tool found it within an hour.

But seventeen steps is not a usable bug report. You know something failed, yet not which step caused it. The shrinker takes the failing sequence and trims it down until the actual fault is isolated. That reduction is what turns a machine finding into something a developer can act on.

The frameworks all trace back to one idea

Property-based testing is a concept, not a single product, and the implementations vary by language and stack.

The approach began with QuickCheck, written in Haskell, and was then ported widely. There are frameworks for .NET, such as FsCheck, and options in the Python ecosystem, among others. Commercial tools exist too, including ones used in automotive software where a failure can cost lives.

Different teams implement the generation and reduction differently, which is why the framework landscape is broad. The underlying contract stays the same: describe the system, let the tool test it.

The real goal is to go past human imagination

The point of property-based testing is reliability, and the route to it is coverage no human could write by hand.

A person writing test cases hits a ceiling. The number of permutations that can go wrong in a system at scale runs far beyond what anyone can picture, let alone encode. Formal methods in general exist to cross that ceiling.

Concurrency shows the same principle from another angle. Threads and locks fail intermittently, only when a particular set of external conditions lines up. Your unit tests pass. Integration tests pass. Production runs clean for two months, then breaks on the first day of the third.

TLA+ addresses exactly this by building a state space of every possible move and combination, then checking for deadlocks and concurrency faults. Nikhil notes that AWS has published work on using TLA+ formal methods to uncover bugs that would otherwise surface only intermittently, sometimes revealing design flaws that demand a rewrite.

The key idea remains the same, that human imagination works to a certain point. Once you go out at the scale, the number of combinations of things that can go wrong is so huge that it is not practical for a human to think or implement those test cases. — Nikhil Barthwal

Describe properties, not transactions

You define what must always be true about the system, and you can do it at different levels.

Consider an e-commerce purchase. Buy n boxes of chocolate, and two properties must hold: the inventory drops by exactly n, and revenue rises by n times the price. Stated simply, this seems trivial.

It stops being trivial under eventual consistency. Distributed systems run across multiple databases and data centers, so a property might hold in one data center and break in another. Those are the conditions you want generated and checked, not assumed away.

The level you specify at decides the kind of test. A specification at the function or service level behaves like a unit test. A specification at the system level behaves like an integration test. Most teams do both, and doing both makes sense.

Where property-based testing does not belong

The method breaks down when each test execution carries a real cost, because a million generated cases means a million real charges.

Nikhil recalls testing at BlackBerry, where a test meant placing an actual billed phone call. A million test cases meant a million calls and a bill to match.

The same trap applies to cloud resources. If your tests write to storage on AWS, you pay for every resource the tests consume. Generate a million cases automatically and the cost can spiral out of control. Where testing destroys resources or runs up charges, property-based testing is the wrong tool.

How to start: small first, then grow

Begin with one service, prove the approach works in practice, then expand scope. This advice holds for any new testing tool, not only property-based testing.

Growing organically surfaces roadblocks early, in line with the fail-fast idea. If the tool works for one service, it will likely work for the next, and that gives you something concrete to show.

The harder part is rarely the code. It is the culture. Introducing a new method forces people to think differently, and in a large organization that means changing the minds of many experienced engineers who already trust their own way of working.

Nikhil treats this as a selling problem. You are the seller, the developers are the customers, and they have to buy the idea. A vendor praising their own product convinces no one. A peer who says the tool genuinely helped them carries far more weight. So pick one willing team, show them the value, and let them recommend it to the rest.

Decision pointFavors property-based testingArgues against it
Combination spaceMillions of failure modes, too many to hand-writeFew, well-understood inputs
Failure typeIntermittent, edge-case, concurrency bugsSimple, deterministic paths
Cost per test runCheap and repeatableEach run is billed or destructive
Rollout scopeOne service first, then expandAll-at-once across many teams

The tool assists the tester, it does not replace them

Property-based testing narrows a vast space of possible failures down to a short list a human can actually examine.

A million things can go wrong, and no person can hold that in their head. The system picks out the twenty cases that look like real problems, and you focus there instead of on the full space.

Some of those twenty may be false negatives, where the reported failure is actually a flaw in how you described the system rather than a genuine bug. Human supervision stays in the loop. When a case fails, you get the full sequence, you verify whether it is a real failure, and the shrinker points you to where things went wrong. When you confirm a genuine fault, you can write a dedicated test case for it.

Nikhil draws the parallel to the wider debate about generative AI replacing people. His position is the same in both cases: the tool makes you more productive, it does not replace you.

Share this page