What is synthetic test data and why is it used in software development?

Synthetic test data are artificially generated data sets that simulate real data but do not contain any sensitive information. They are used to comply with data protection guidelines, map complex test scenarios and replace live data with secure alternatives.

What are the challenges of generating test data, especially in the context of electronic invoices?

The generation of test data is often characterized by complex data requirements, especially in the case of electronic invoices. Manual data generation is time-consuming and cost-intensive. In addition, data protection regulations and anonymization requirements must be strictly adhered to, which makes the use of productive data more difficult.

Which methods are used to generate synthetic test data?

Various methods are used to generate synthetic test data, including the use of format descriptions such as XML or JSON, mathematical models to simulate different data structures and the generation of structured and valid data based on database schemas.

What advantages does synthetic test data offer over productive data in software testing?

Synthetic test data ensures data protection compliance through anonymization, allows the mapping of diverse structures and extreme values as well as targeted specification violations for robustness tests. They therefore enable more comprehensive and secure test scenarios.

How is synthetic test data used in practical applications such as electronic invoicing?

Synthetic test data is used in projects for electronic billing and bank transfers, for example for tests with XML-based formats such as CAMT. Tools such as the Mustang tool support the validation of X-billing formats in Germany using such data.

What are the future prospects for the use of synthetic test data in electronic invoicing?

The mandatory introduction of electronic invoices in various industries is expected by 2028. Synthetic test data will play a central role in quality assurance and system compatibility, especially when dealing with legacy systems and data-driven test scenarios.

Synthetic test data

Synthetic test data is machine-generated input data that is derived from formal schemas such as XML or JSON schemas without using production data. They fulfill data protection requirements, correctly map complex structures and systematically cover extreme values and special structural cases in order to find errors in software that would be overlooked by manually generated test data.

Key Takeaways

Generating synthetic test data from schemas solves two core problems at the same time: complex data structures can be generated automatically, and data protection requirements according to GDPR are met because no production data is required.
Errors have been found in every system that has been tested with this generated test data to date, because software that has not been tested automatically almost always contains undetected defects.
Valid electronic invoices are rejected as invalid by common implementations such as Mustang, for example due to unimplemented rounding factors for certain currencies, resulting in unpaid invoices and unexpected reminders.
In addition to pure data volume, the approach optimizes the information content per test data package: extreme values, empty fields and structural variants are selectively interspersed, so that even 1,000 data records can achieve high coverage.

Why test data so often becomes a bottleneck

Test data is one of the recurring problems in software projects. It causes effort, blocks pipelines and in the end often does not deliver the informative value that is hoped for. There are two main reasons for this.

The first reason is complexity. If you want to test a system with realistic inputs, you have to build data that follows an often highly complex format. One example is the electronic invoice, which is gradually becoming mandatory in the B2B sector. If you look at the underlying standard, you quickly realize that generating such data records manually is extremely time-consuming.

The second reason is data privacy. An invoice refers to people, i.e. personal data. According to the GDPR, this cannot be tested without further ado. The usual reaction is to change the data, set fields to zero, replace values with generic stubs and shuffle data records.

This is exactly where a dilemma arises. If you alienate too little, you have one foot in jail legally. If you alienate a lot, you will hardly find any new errors because the data has already existed in a similar form. Data that contains nothing new does not trigger any new behavior.

Synthetic test data from schemas instead of productive data

An alternative approach generates test data directly from schemas, not from production data. This is based on structured data formats such as XML schemas, JSON schemas or tabular structures from databases. This description can be used to generate masses of data that look correct and have the necessary integrity.

The advantage lies in data privacy. If you only need the description of a format, you don’t need any production data. This means that the process is fully compliant with data protection regulations because no real personal data is involved. This distinguishes this approach from methods that train AI models on live data.

This difference is more than just a technical nuance. If a model is trained on real data, it remains opaque as to what ultimately flows into the generated test data, especially on a large scale. If you do not process productive data from the outset, you cannot make this mistake.

We don’t need production data, we just need a schema. That’s why we can’t get into mischief.
Dominic Steinhöfel

How a format description becomes test data

In essence, a schema is translated into a formal model on which the generation takes place. This model describes the syntax of the data format and at the same time all types of restrictions. Any additional requirements are projected into this model, and the model is then used to solve the problem.

Additional conditions can be added to the pure schema. A field must have a certain value, a name must have a minimum length. The user can specify such requirements and they are incorporated into the generation process.

Complex constraints can also be solved for XML using so-called Schematron rules. A concrete example: the individual items of an invoice must add up to the correct total, including the rounding factor and tax rate. Such calculation conditions, totals and maximums can be resolved to a certain extent.

The hurdle lies not in the technology, but in the usability. Mastering Schematron is demanding and cannot be expected of normal users. The open task is to close the gap between a simple requirement such as “the IBAN should be valid” and the formal rule behind it. At the moment, people are still translating this into post-processing or Schematron rules in direct contact with the pilot partners.

What constitutes good coverage for test data

Mass alone does not make good test data. The decisive factor is how much a data package covers. This is why we track what has already been varied in order to accommodate as much information as possible per data package instead of blindly generating thousands of similar data sets.

Coverage guarantees can be derived from this. One statement can be that all structural subtleties of a format have been tested, or in the case of very large formats, that around 70 percent of them have been tested. This measurability raises the quality of the test data beyond mere quantity generation.

The practical experience with the quantity is interesting. Not everyone wants huge amounts of data because every input has to run through their own pipeline and every output has to be analyzed. Some teams name a thousand inputs as their maximum. Those who generate little but test a lot save themselves from having to analyze mountains of redundant data.

Test methodology: Randomness and extreme values belong together

Good test data mixes randomness with targeted extreme values. Purely controlled data overlooks errors, which is why chance is a kind of superpower. At the same time, targeted borderline cases are needed, because that’s often where the bugs are.

For XML, a domain analysis of the primitive data types was carried out in order to scatter in specific extreme values. Classic boundary value analysis methods are covered in this way: fields that were not previously empty are set to empty, a floating point value that was not yet infinite is set to infinite.

Security is part of the approach, but not the core. Within the framework conditions of a format, documented attack patterns can be built into string fields, such as SQL injections or XML attacks from catalogs like those of OWASP. However, the focus is on robustness: deliberately causing a system to crash so that this does not happen in productive operation.

The approach thus occupies its own niche between two poles. It tests what developers don’t think about and what pentesters don’t cover because they are specifically looking for their backdoors. Pentesters and their tools retain their authorization, the rest in between often remains untested.

Why every untested system contains bugs

A pragmatic rule of thumb: If you use synthetic test data against a system that has never been automatically tested, you will find errors. In all systems tested so far, some have been found. If no errors are found, the software is usually simply not interesting enough.

This is true even with high quality standards. A pilot partner from the telecommunications sector is testing an XML protocol for number portability that used to run by fax. Despite a standard far above the norm, there are cases where the team realizes that something should not have happened.

Electronic invoices: a practical case with real damage

Electronic invoices show how expensive poorly tested processing can be. The publicly available Mustang tool from the ZUGFeRD format environment, one of the two German formats for electronic invoices alongside the X-Rechnung, was one of the tools tested.

The errors found range from minor inaccuracies to crashes or the rejection of valid invoices. A concrete example: a rounding factor of 0.05 is permitted for totals. This factor is significantly higher for the Hungarian forint because the currency is worth less. This is exactly what was not implemented.

The result is real damage in day-to-day business. A valid invoice from Hungary is dropped by the recipient system without a response. The invoicing party does not receive its money on time. The recipient receives a reminder instead and doesn’t understand why, even though everything should have worked.

Such cases are becoming increasingly explosive because electronic invoices are gradually becoming mandatory. In the B2B sector, this will be mandatory by 2028. Anyone who sends or receives invoices doesn’t want to have to deal with this kind of stress. The new process should work at least as reliably as the old one.

Dependencies, references and protocols as the next step

Complex test data needs more than isolated values. Formal modeling can be used to encode keys, dependencies and references. In XML, for example, there are ID fields that must be consistent with each other, and it is precisely these relationships that can be resolved.

Protocols present their own difficulties. There are dependencies between different packages, which makes protocol testing a challenging field with a lot of research behind it. What is implemented next depends on the specific needs of the pilot partners.

The range of practical cases shows how different the requirements are. Extensive OpenAPI specifications for cloud services as well as the CAMT format from cash management, which also includes SEPA credit transfers, conveniently also an XML format, are being tested.

When there is more logic in data than intended

The publication of schemas remains a tricky point. The approach is understood as test data as a service, where the schema must come from the customer. Confidentiality is regulated by NDAs, and there are no plans to operate the solution in the customer’s in-house infrastructure.

The argument: the data format, i.e. structure and requirements, is transferred, not the real data, which usually contains the actual business secrets. In many cases, this significantly lowers the hurdle.

However, this dividing line cannot always be drawn cleanly. In mature legacy systems in particular, there is sometimes more logic in the data structure than you would like. Anyone externalizing schemas should check in advance how much technical knowledge the mere format description already reveals.