Skip to main content

Search...

Properties Based Testing

Millions of test cases generated automatically instead of laboriously written by hand: What property-based testing is and what its limitations are.

9 min read
Cover for Properties Based Testing

Property-based testing refers to a method in which individual test cases are not written manually, but instead the system automatically generates test cases from described properties. Three components work together: a modeling language for system properties, a generator for test cases and a shrinker that reduces errors to their exact origin.

Key Takeaways

  • Property-based testing automatically generates test cases from a formal description of system behavior instead of writing each case manually because the number of possible fault combinations in distributed systems far exceeds human capacity.
  • The three core components of the approach are generator, which generates test cases from the specification, verifier, which detects violations, and shrinker, which reduces a long error chain to the smallest reproducible step.
  • Property-based testing reaches its limits when each generated test case incurs direct costs, such as phone calls or cloud resources, because with millions of automatically generated cases, expenses can get out of control.
  • The concept started with the Haskell framework QuickCheck and has been ported to many languages; today, there are separate implementations for .NET, Python and other stacks, including commercial tools for safety-critical areas such as the automotive industry.
  • The biggest hurdle when introducing property-based testing is not the technology, but the change in mentality within the team: gaining a single pilot team as a reference is more effective at convincing colleagues than any internal recommendation.

What is property-based testing?

Property-based testing does not describe individual test cases, but the properties of a system. From this description, a tool automatically generates the low-level test cases, often thousands or millions of them.

The difference to the classic approach is fundamental. Normally, you define inputs and expected outputs: Input I1 results in output O1, I2 results in O2 and so on. With property-based testing, you instead formulate a rule that must apply to any input.

A simple example makes this clear. A service takes two inputs A and B and returns the sum C. The property then reads: If you subtract the value B from the output C, the result must be A. That’s all you describe. The system generates any number of combinations of A and B and checks each one against this rule.

This becomes valuable precisely at the edges. Buffer or integer overflows can occur with certain combinations that approach boundary values. A human writes a few test cases for this with zero, one, two, negative and positive numbers. The generated variant covers the input space that a tester would never go through completely by hand.

Why distributed systems require generated tests

In large microservice landscapes, the number of error combinations exceeds human imagination. This is exactly where property-based testing comes in.

The math is simple. With n services, where an error only occurs if two or three other services fail, the number of relevant combinations is in the order of n to the power of 3. With a thousand services, that’s millions of possible errors. Nobody writes a million test cases by hand for this.

Then there is the way in which errors occur in distributed systems. One service fails, the error propagates, and it only becomes noticeable when the fifth server goes down. Such chain reactions can only be detected if the test cases are run in large numbers and in unexpected sequences.

The three components: specification, generator, shrinker

In practice, property-based testing consists of three parts that work together.

The first component is the modeling language or specification. This is where you describe the behavior and properties of your system. In the simplest case, this is an equation; in real systems, formal specification languages such as TLA+ are used.

The second component is the generator. It takes the specification and writes the test cases from it. If it finds a violation, you know that something is wrong. You can write your own generator or use a standard generator.

The third component is the shrinker, also known as the reducer. Generated test cases often run as long sequences of steps. If the system finds an error after 17 individual steps, this information alone is of little help: you do not know which step caused the problem. The shrinker reduces the sequence to its core and shows you exactly where it goes wrong.

A real example shows the effect. During a test of a Google database, an error only occurred after 17 repeated steps. A combination that no one would guess. With property-based testing, the error was found within an hour.

What frameworks and tools are available

The concept is language- and stack-independent, and various frameworks are used to implement it. They differ in how they generate and reduce test cases.

It all started with QuickCheck, a tool from the Haskell language. From there, the principle was transferred to numerous languages. FsCheck exists for .NET, and there are also suitable libraries for Python. There are both commercial and open source versions.

Commercial tools are used in safety-critical areas such as the automotive industry. The software is business-critical there: if it fails, people may die. Complete coverage of the combinations directly serves reliability.

Describing properties means thinking about the entire system

A property is a rule that must apply after every transaction. It forces you to precisely define the behavior of your system.

Take an online store. If someone buys n bars of chocolate, the stock level must fall by exactly n before and after. At the same time, sales must increase by the price of these n bars. Both are properties that you give the system.

In a simple system, this is trivial and almost never a problem. However, real systems work in a distributed manner, across multiple databases and data centers, with eventual consistency. A property can apply to a data record in data center one and not in data center two. How do you test these conditions? You describe the property, and the system writes the tests for it.

The level at which you specify determines the test type. If you describe at service level, the result is a unit test. If you describe it at system level, the result is integration testing. In practice, most people do both.

Formal methods solve a related problem from the design side, while property-based testing checks finished software. The core idea is the same.

Parallel systems with threads and locks have an unpleasant peculiarity: errors only occur sporadically. Unit tests run through, integration testing runs through, in production everything runs smoothly for two months. On the first day of the third month, the system crashes because a certain combination of external conditions has occurred.

TLA+ addresses this by drawing up a state space diagram of all possible processes and combinations. This makes it possible to recognize where a thread deadlock occurs or which concurrency problem is imminent. Sometimes the consequence is that part of the system has to be redesigned because there is a fundamental design problem. Amazon AWS uses this method and has published about it publicly.

When property-based testing does not fit

The line is drawn at tests that are destructive or consume real resources. If every test case costs money, the automatic generation of millions of cases gets out of control.

Two examples illustrate the problem. In telephony, one test triggers a real call. One million test cases means one million calls and a correspondingly high bill. It’s the same in the cloud: if your test writes to storage media, you are charged for all the resources used. With automatically generated masses of test cases, the costs explode.

In such cases, property-based testing is the wrong choice. The strength of the method of generating any number of cases becomes a financial risk here.

How to get started: start small, grow organically

The most important advice is simple: start with one service and then expand step by step. This applies not only to property-based testing, but to any new method.

Try out the method on a manageable part of your distributed system and see how well it works. If it grows organically, risks and obstacles will become apparent early on. This is the agile idea of failing fast. If the tool works for one service, it will probably also work for the next.

You can find plenty of sources today. There are experience reports and recommendations for almost every framework in books and online. Information is no longer a bottleneck.

Introducing a new method is sales work, not programming work

The hardest part is not the code, but convincing people. A new tool requires a change in mentality, and that is difficult to achieve in a large workforce.

The hardest part of my job is convincing people to do something. And these are people who, like me, have twenty years of experience.

  • Nikhil Barthwal

Experience can be an obstacle. People who have been around for a long time know exactly how they work and react skeptically to new tools. Nikhil experienced this when he built a new build system, showed it to everyone and nobody used it.

His approach: Find a single team out of fifty that is willing to participate. Show them the concrete benefits and let this team pass on the value to the others. When introducing a technology, you are the seller, your developers are the customers. A recommendation from another team weighs more heavily than any self-praise, because the provider always has a conflict of interest.

How to trust generated tests

Property-based testing does not replace testers, it supports them. If a generated test case fails, you get the full sequence of events and can check whether the error is real.

This is the answer to the confidence problem. With self-designed test cases, you know every step. With millions of generated combinations, you don’t see everything, but the system filters for you. Out of a million possible errors, you might be left with twenty specific cases to focus on.

False-negative results do occur. Often the actual error lies in the way you have described your system. That’s why it needs human oversight: you need to understand what happened before you fix it. This is where the shrinker provides the groundwork by showing you exactly where it went wrong.

The parallel with the AI debate is obvious. Property-based testing doesn’t replace testers, it makes them more productive. That is precisely its promise.

Share this page

Related Posts