Skip to main content

Search...

Correctness through formal methods

Testing proves bugs, but never freedom from bugs. Formal methods can do this, from type systems to verified kernels, and the effort is more scalable than expected.

9 min read
Cover for Correctness through formal methods

Formal methods are mathematically rigorous techniques to prove software properties for all possible inputs, not just for selected test cases. They range from type systems and model checking to proof assistants such as Isabelle. They can be used to formulate guarantees that cannot be achieved through tests, such as that money will not be lost in the event of any conceivable transaction interruption.

Key Takeaways

  • Testing cannot prove freedom from errors, because tests can only cover a finite number of examples, never all possible inputs.
  • Formal methods scale from lightweight type systems to complete mathematical proofs with tools such as Isabelle or model checkers.
  • Anyone planning formal verification must build it into the development process from the start, adding it on later does not work.
  • Verified code is no substitute for testing: The verified C compiler CompCert only checks the compiler core, the boundary areas are additionally secured by fuzzing.
  • AI-supported proof search, such as the Sledgehammer in Isabelle, is risk-free in formal systems because each generated proof is automatically checked again independently.

Why testing cannot guarantee freedom from errors

Testing provides examples, not evidence. If you run through a test case, you check exactly one constellation. This is followed by the second, the third, perhaps the fifth case. What remains is a list of random samples, never a statement about all possible inputs.

Edsger Dijkstra has already formulated this: Testing can be used to show the presence of errors, not their absence. This is sufficient for many applications. Not for some.

In payment systems, in automotive or in medicine, there are properties that must apply to every conceivable constellation. An example from payment transactions: a card is stopped via NFC, then the connection is lost. Regardless of when this happens, the transaction must be able to be rolled back and repeated without money being lost. This guarantee applies to all termination times. A random testing does not cover it.

What formal methods can do

Formal methods think about software in a mathematically rigorous way. Instead of individual test cases, the system is strictly modeled, and properties such as invariants are formulated in such a way that they make statements about all possible inputs and outputs.

The computer takes on the role of the tester. In a proof assistant, the machine proves that the proof does not contain any gaps, false conclusions or omissions. The result is a guarantee that the worst case cannot slip through.

Bianca Lutz gets to the heart of the matter: it is about the guarantee that the worst case cannot even exist, and about proving this guarantee. This is exactly where formal methods come in, where testing stops in principle.

Formal methods are not all-or-nothing

There is a broad spectrum between complete proof and no formalization at all. Most developers use the most lightweight formal method on a daily basis without calling it that: the type system.

A type checker says with certainty that a value is a string and not a number, and that for all calls to a function. The result comes almost for free, the checker simply runs along. But the statement is limited: A type says little about the correctness of the behavior.

If you want to talk about correctness, you have to go deeper into the semantics of the system. The gradations range from type systems and model checkers to complete proof assistants. More expressive power means more expressiveness, but also more effort.

This overview classifies the levels according to effort and expressiveness:

LevelEffortExpressiveness
type systemvery low, runs withlimited, e.g. string instead of number
Model checkinglow to medium, highly automatedfinds forbidden paths, counterexamples
Proof assistant (e.g. Isabelle)high, often manualcomplete proof of correctness

How much effort a complete proof costs

Complete verification is expensive. For the verified microkernel seL4, which was proven with the Isabelle tool, the estimated overhead is a factor of 10 to 100, depending on the programming language. With a factor of 10, this means that every line of code requires ten lines of proof.

In practice, this is a tough challenge for project planning. A feature that took two months to develop can mean two additional years for validation. Whether this is worth it depends on the level of suffering and criticality.

This is why full verification is rarely the right default. It makes more sense to ask which level is suitable for the specific use case and how much you are prepared to invest. The comparison with test coverage is valid: 80 percent is better than 0, and sometimes 80 percent is completely sufficient.

Verification belongs in development, not behind it

Formalization does not work as a downstream test. The idea of throwing a finished program over the fence to specialists who then demonstrate correctness misses the point.

The example of seL4 shows why. If you think about verification early on, you get feedback during development from someone who thinks about the domain in a more abstract way. This results in hints about interesting system properties or the recommendation not to use a special lock because it generally destroys certain guarantees.

This changes the development process for the better. Only if the proof is involved from the outset can a system be built that is functionally mature and at the same time remains accessible to proof.

Automatic or manual: the whole spectrum

There is not just one verification tool, but many different ones, from fully automatic to completely manual. Which one is right for you depends on the property you want to verify.

Drivers under Windows are an automated example. Microsoft checks the drivers supplied for correctness properties, for example to ensure that there are no endless loops or null pointer accesses. The manufacturer supplies C or C++ code, automated tools analyze it statically.

Sometimes these tools need support. In the case of a non-trivial while loop, a machine-readable comment helps, which states, for example, that a counter is reduced in each run. The tool then only checks whether the counter decreases and recognizes the termination.

This approach is close to the code that is actually executed. It’s the gold standard, but it only proves what it proves: that the driver doesn’t crash, not that it does what it’s supposed to.

Model, specification and code are different things

Correctness has several points of reference. Anyone who writes down an abstract model of the algorithm and implements it in a formal tool is not checking the real code, but the model. Nevertheless, this is usually helpful because many errors start in the specification.

Everyone knows the pattern: a user story sounds simple until you think about what should come out in the borderline cases. A well thought-out model uncovers these gaps before code is created.

The furthest level is the pure specification, formulated in natural language and supplemented by a few formulas. This also sharpens understanding, even if the implementation may ultimately deviate from the model.

Bulletproof does not mean invulnerable

Formal methods are not silver bullets. They do not replace every other measure, and not everything is automatically proven and beautiful afterwards. They are an enrichment for better understanding the system and arriving at a viable specification.

The verified C compiler CompCert shows the limit in concrete terms. The compiler core is verified. The translation at the input and the step to real machine code at the output are not formalized. These edges can be secured with compiler fuzzing, which sends randomized C programs through, or with a second reference implementation for comparison.

My head is still not protected, but my whole torso, a very large area, is covered. This makes me safer than without it.

Bianca Lutz

The image of the bulletproof vest is a good one: it covers a large area, not the whole body. Verification and testing complement rather than exclude each other.

A simple implementation alongside the complex

Performant implementations are rarely obviously correct. There are two ways: either a system is so simple that there are obviously no errors, or so complex that there are no obvious errors. Usually you need the complex variant because performance or other requirements demand it.

This leads to a pragmatic technique. Place a simple, slow implementation next to the complex, fast one and compare the two. If the comparison run takes an hour, this is bearable as long as the high-performance variant is proven to work correctly.

What changes to the verified system cost

The cost of a change depends on where it starts. If only one implementation detail changes without changing the guarantees, such as a more efficient algorithm within a module, a local adjustment is usually sufficient.

This is exacerbated by the lack of modularity of the proofs. In contrast to software, there is no real information hiding because you have to be able to talk about the internals. A data structure that you want to prove something about later cannot be hidden behind a loose coupling.

If you change a basic definition, you can feel it through the entire proof base. Bianca reports that it took a year of work to update all proofs to the new language after such a change.

A concrete C problem is the shared memory. Because in principle anyone can write to any memory in C, it must first be proven that module A does not write to the area of module B. For this, there is separation logic. This is where the separation logic with separate heap areas comes in. If the pointer structure changes, the separation proofs must be repeated. This problem does not even occur in a functional language.

AI helps with proofs without you having to believe it

Machine learning has been supporting proofs for over ten years, long before it was called AI. In Isabelle, the Sledgehammer tool helps to write proofs faster.

Sledgehammer searches for relevant facts from the context, similar to an IDE that suggests a suitable method, and throws them into automatic proof tools. Machine learning recognizes which facts work surprisingly often and suggests them more often. The main task is to prevent the point at which the computer stops the search because otherwise it won’t finish.

The decisive advantage: Proving is the ideal use case for AI because the result is appraised independently. Hallucinations are not a problem. If a generated proof is not correct, the proof checker rejects it and you try again.

Isabelle does not believe her own AI.

Bianca Lutz

No matter what crude stuff falls out of a language model, it will be checked again anyway. This is exactly why you can go wild in the search for evidence without hesitation: The safety net checks every suggestion against the logic of the system.

Frequently Asked Questions

Formal methods are systematic approaches to the specification, verification and validation of software. They include techniques such as model checking, formal specifications and theorematic checks. The benefit of these methods lies in the improvement of software quality, as they detect errors at an early stage and increase reliability. Through precise mathematical models, they enable clear communication between developers and stakeholders, which reduces misunderstandings and makes the development process more efficient.

Formal methods are crucial for software correctness as they use mathematical approaches to identify and correct errors in software. Through precise specifications and verification, they ensure that programs only perform the intended functions. This reduces the risk of bugs and increases reliability. Formal methods also enable systematic analysis and improve understanding of the software architecture. Overall, they make a significant contribution to quality assurance in software development.

Frequently used formal methods for software verification are model checking, theorem proving and abstract interpretation. These methods make it possible to analyze programs mathematically and systematically identify errors. Model checking checks models for certain properties, while theorem proving provides evidence for the correctness of programs. Abstract Interpretation analyzes programs by approximating their behavior. The use of formal methods increases the reliability of software, especially in safety-critical applications, as they enable precise and complete verification.

The main differences between model checking and theorem proving in the formal methods lie in their approach. Model checking systematically examines models to ensure that properties are satisfied, checking all possible states. Theorem Proving, on the other hand, uses logical deduction to derive properties from axioms and often requires manual intervention. While model checking is automated, theorem proving requires more interaction from the user. In addition, model checking is efficient for finite systems, while theorem proving can also be used for infinite structures.

When using formal methods, practical challenges arise such as the high level of effort required for modeling and specification, which is often time-consuming. In addition, understanding formal methods requires specialist knowledge, which makes it difficult to train employees. Complex systems can be difficult to verify, leading to incomplete analysis. Integrated development environments are often inadequate, which hinders the use of the methods. Finally, acceptance within the team can vary, which makes implementation even more difficult.

Formal methods are systematic approaches to the specification, verification and validation of software and systems. They use mathematical models to check the correctness of system behavior. Model checking is a key technique in this context, which automatically examines all possible states of a system to ensure that certain properties are fulfilled. This allows errors to be detected and eliminated at an early stage, which increases the reliability of software.

Formal methods are particularly advantageous in safety-critical software projects such as aviation, medical technology or the automotive industry. In these areas, freedom from errors is crucial, as mistakes can have serious consequences. Formal methods also increase the reliability and traceability of software in highly regulated areas such as the financial sector. They also support the development of complex systems where requirements need to be precisely defined and verified. Their application leads to higher software quality and security.

Formal methods use specialized software tools such as Coq, Isabelle, and Alloy to verify and model software. These tools help to create mathematically precise specifications and identify errors early on in the development process. In addition, tools such as TLA+ and Spin provide support for model checking and temporal logics. By using such software tools, developers significantly improve the reliability and security of their software projects, as formal methods work systematically and rigorously.

Temporal logics are formal methods that formulate statements about the temporal aspects of systems. They make it possible to derive behavior over time by using steps such as always, sometimes or until. Temporal logics are used in verification to ensure that systems fulfill certain temporal properties such as safety and liveness. This allows errors in software and hardware to be detected and rectified at an early stage, which increases the reliability of the systems.

Formal methods are mathematical techniques for specifying, verifying and analyzing software and systems. They provide precise rules and models to prove the correctness of programs and detect errors at an early stage. The use of formal languages increases clarity and comprehensibility. Formal methods help to understand complex systems and provide a secure foundation for critical applications, for example in aviation or medical technology.

Share this page

Related Posts