What is meant by a reference implementation as a test oracle?

A reference implementation as a test oracle is a reliable reference model that is used in the software test process to verify the correctness of developer code through automated test cases and to identify errors.

What role does the reference implementation play in the software test process?

The reference implementation serves as a test oracle that automatically executes test cases and compares the developer code with the reference model in order to detect deviations and improve quality assurance.

What are the advantages of using a reference model for quality assurance?

The use of a reference model enables automated error detection, increases efficiency in error correction and increases overall productivity through optimized test procedures.

How is the reference model implemented and applied in software development?

The reference model is used in software development for code review, supports traceability between requirements and implementation and acts as a test oracle to ensure software quality.

What future developments can be expected in the area of reference implementations as test oracles?

Future developments include the integration of AI support to optimize test techniques and the further development of the reference model to meet current and future requirements in software testing.

How are AI and new technologies influencing the future of quality assurance with reference implementations?

AI and new technologies enable improved automation and optimization of the test process, strengthening the role of reference implementations as a test oracle and making quality assurance more efficient.

Implementation as a test oracle

A reference implementation as a test oracle means that the tester implements the same functionality independently of the developer in a simple high-level language such as Python or Matlab. Both results are compared via back-to-back testing. This replaces manual target value calculation, simplifies reviews at code level and scales even with thousands of signals.

Key Takeaways

A reference implementation as a test oracle means that the tester rebuilds the software independently of the developer and both versions are compared in back-to-back testing.
Testers must not have access to the original source code when creating the reference implementation so that the developer’s coding errors are not unknowingly adopted.
The review of a reference implementation is a structured code review with a checklist, which requires significantly less effort than the manual recalculation of target values.
Automatically generated stimuli for structure-based code coverage can be run against the reference implementation in order to secure the last missing percentage points without creating a self-fulfilling prophecy.
With this approach, requirement changes only require a code change at one point in the reference implementation instead of manual adjustments to dozens of target values in test cases.

What is a pseudo test oracle with reference implementation?

A pseudo test oracle with reference implementation is a test approach in which the tester recreates the function to be tested a second time: independently of the developer, in a simplified form. This replica serves as a benchmark. The delivered software and the reference implementation run on the same input data, and their outputs are checked against each other.

The developer implements his productive code, which later goes into the vehicle. The tester implements the same logic in parallel, but in a simple high-level language such as Python or Matlab Script. If the outputs of both implementations match, the behavior is considered to conform to the specification.

Stefanie Leitner works for an automotive supplier in the Volkswagen Group that develops embedded software for autonomous driving and vehicle safety. This approach has been used there for over ten years to test complex safety-critical functions such as lane departure warning and emergency braking assistants.

Why manual setpoints no longer work with high complexity

Manual calculation of target values does not scale as soon as the number of signals to be tested reaches into the thousands. This is precisely the problem that the reference implementation solves.

The first software components alone, which verify incoming bus signals, do not check 10 or 20 signals, but 2500 to 3000. The format, validity and value range must be correct for each of these. If you calculate these target values manually, you can’t keep up. If one little thing changes, the test is stuck and cannot provide early feedback.

The second driver is the algorithms themselves. The trajectory calculation in autonomous driving, such as the question of whether the lane is clear and an evasive maneuver is possible, consists of many successive calculation steps. For each test step, the tester would have to determine the expected value individually. Both problems led to the same solution: an automated oracle that calculates the target values instead of specifying them manually.

How the reference implementation is structured in detail

The reference implementation maps the full logic of the function, not just a dummy. At the end there is a back-to-back testing, which compares the original software signal by signal against the replica.

Unlike the productive code, the reference implementation does not have to go through the entire development process. Coding guidelines do not play a role, nor does optimization for resources, because the code never goes into the vehicle. Libraries in Python provide many functions ready-made, which speeds up implementation.

Fixed rules apply to ensure that the approach works:

Independence: Developer and tester must be two different people. Nobody builds their own model and then tests it against itself.
No access to the original code: Testers do not see the productive source code at the time of implementation. This prevents them from copying solution approaches or reproducing the same coding error.
Traceability: As in the original code, the requirements are also linked in the reference implementation. This makes completeness verifiable and helps to ensure consistency with the specification.

The size of the reference implementation depends on the respective unit, typically between 100 and 1000 lines. It is deliberately kept small at unit level.

The review is transformed from a computational marathon to a code review

The biggest everyday gain lies in the review. Where testers used to have to recalculate every manually calculated target value, this is now a code review with a checklist.

ISO 26262 requires an inspection of test cases from ASIL B onwards. In the classic approach, this meant recalculating all manually determined values a second time: time-consuming and error-prone. With the reference implementation, the reviewer checks the code instead and can use the linked requirements to check whether the tester has interpreted them correctly, thus ensuring completeness and consistency.

A project in which a project manager shied away from the effort and went back to the classic approach with scripts showed just how much this advantage weighs. By the time of the review, everyone was moaning. The lesson learned was clear: never do it that way again.

Developers and testers start in parallel, not one after the other

As soon as the requirements are released, developers and testers start at the same time. One implements the productive software and its developer tests, the other the reference implementation and the test case specification.

This is preceded by a double review of the requirements, once from the developer’s perspective and once from the tester’s perspective. A lot of the vagueness is already caught here. Ideally, both sides are completed at the same time and can go straight into test execution.

If the comparison fails later, the question arises as to which side is responsible for the error. The default: The tester first checks his own reference implementation. Only when he is sure that it behaves as specified does an error ticket go to the developer, who debugs the software.

The real benefit lies in test depth and maintainability

The approach compares every output signal at all times, not just the few signals in the scope of a single requirement. This significantly increases the depth of testing.

This is primarily used at unit and integration test level. Because a value is available for each signal at each test step, side effects and side effects emerge that would have been overlooked if only the specified output signals had been considered.

This also pays off in the event of changes. If a requirement changes, one line in the reference code is often enough, such as a greater than or equal to instead of greater than. After the back-to-back test, the target values are updated without anyone having to manually adjust 60 expected values in the test cases. Variants, coding and calibration data can be programmed directly: Swap data set, done.

Another effect was seen in the structure-based coverage. Decision coverage is required from ASIL B and MCDC from ASIL C. The last five percent up to full coverage is tricky for testers.

If you generate the target value, it’s a self-fulfilling prophecy. But we have our reference implementation. We take the stimuli, run them over the reference implementation and thus also get for the last five percent whether the software behaves as it should.
Stefanie Leitner

The tool only generates the stimuli from the code. The target values come from the reference implementation, otherwise the code would confirm itself.

Initial skepticism in the team: Is the error in the code or in the model?

The biggest hurdle is not technical, but a question of trust. Initially, developers doubt whether a reported error is really in their software or in the reference model.

This skepticism can only be resolved by the upstream mechanisms: independent implementation, review, traceability. Everyone makes mistakes, and occasionally something slips through. In practice, however, the comparison helps, and when in doubt, both parties look at the reference implementation together. It often becomes clear that someone has understood the requirement differently or has forgotten something in the code.

For the developers, the reference implementation even has a secondary benefit when debugging. They can see how the tester has solved the requirement and recognize more quickly which of the two has not implemented it as intended.

What skills testers need for this approach

Testers need to be able to develop, but not have any specialist knowledge. The basics of programming such as loops are sufficient as a foundation, the rest can be learned.

Many testers in the embedded environment come from a computer science background anyway, so the basics are there. Python and Matlab scripting have not caused any problems to date and can be taught via internal training courses. Instead of strict coding standards, there is a template and a few guidelines, especially for traceability. This allows one tester to quickly find their way around the work of another if someone fails.

When the effort is not worth it

Not every function deserves a reference implementation. For very simple units with only a few mathematical calculations, the effort is often disproportionate to the benefit.

That is why there are criteria for determining when the approach makes sense and when it does not. These criteria are to be further refined. At the same time, it will be examined whether AI can accelerate the creation of the reference implementation so that the threshold is also shifted in borderline cases.