7 min read

Synthetic test data

Richard Seidl 06/13/2025

Podcast Software Testing

The generation of test data brings with it challenges that many companies face in the field of software development. The need to create complex and privacy-compliant test data is becoming increasingly urgent, especially in light of new regulatory requirements such as the GDPR. One approach combines basic research with practical applications to generate structured data through automated models. This not only provides quality and integrity of the test data, but also enables efficient error identification. The implementation of electronic invoices shows the potential to optimize existing systems and identify possible sources of error at an early stage.

Podcast Episode: Synthetic test data

In this episode, I talk to Dominic Steinhöfel about a topic that affects many software projects: test data. As the founder of a start-up that specializes in generating synthetic test data, he has a fresh perspective. We discuss how his methodology works and the benefits it brings to companies. It becomes clear that the topic of test data is not only important, but also extremely complex.

"Testing has always been important, and it's somehow becoming more and more important, especially now with AI." - Dominic Steinhöfel

Dominic Steinhöfel is CEO and co-founder of InputLab, an innovative spin-off of the CISPA Helmholtz Center for Information Security. InputLab develops synthetic test data generated from existing specifications such as XML or JSON schemas to make software testing more efficient and effective. InputLab's data-driven solutions help to discover new bugs without using sensitive customer data - a fully GDPR-compliant approach that guarantees both security and flexibility.

Highlights der Episode

Synthetic test data improves the quality of software tests.
Data privacy in testing requires creative approaches to data generation.
Extreme values and random data identify common errors in systems.
Pilot projects demonstrate high quality standards and often find undiscovered problems.
Electronic invoices are becoming mandatory in B2B transactions.

Synthetic test data

Introduction to synthetic test data

Synthetic test data are artificially generated data sets that are used to test software and systems. They are created through targeted data modelling and simulation of realistic but anonymized data sets.

Meaning

Synthetic test data enables extensive testing without the use of sensitive productive data. This test data plays a central role in modern software development, as it rationalizes test automation and helps to detect data errors.

Challenges

The creation requires a deep understanding of the data structures as well as the legal data protection requirements. A well thought-out data architecture can help overcome these challenges.

Reasons for synthetic data

Productive data often contains personal information, the use of which is legally problematic in the test environment. Synthetic data ensures data privacy and at the same time fulfills complex requirements for validity and structure.

Areas of application

In software development and system testing, they are used to test functions under realistic conditions, detect errors and test robustness against extreme situations.

The ability to generate test data accurately forms an important basis for secure and reliable IT systems. Effective test data management is therefore essential for companies to successfully meet this challenge.

Challenges in test data generation

Test data generation faces several significant challenges, especially when it comes to complex data requirements such as electronic invoices. The complexity arises from the variety of formats and specific regulations that must be met in order to generate realistic and valid test cases.

Manual data generation

Manual data generation is a time-consuming and costly process. Employees have to create a large amount of test data with different properties, which not only ties up resources but also increases sources of error. In such scenarios, the role of the test manager in agile projects becomes particularly important, as they play a key role in quality assurance and efficient team collaboration.

Legal requirements

Legal requirements for data privacy and anonymization pose further hurdles. Personal or sensitive data from production systems may not simply be used for testing. Although this protects privacy, it makes it more difficult to use real data for realistic tests.

Problems with production data

The use of production data often leads to problems: In addition to data protection concerns, undetected errors can occur in the system if the test data does not reflect all relevant boundary conditions or extreme situations. Synthetic test data therefore offers a safe alternative that can be tailored precisely to the requirements.

Methods for generating synthetic test data

There are various methods for generating synthetic test data that can be used depending on the use case and requirements. Here are some of the most common methods:

Use of format descriptions

One way to generate test data is to use format descriptions such as XML or JSON. These formats provide a clear structure and definition for the data to be generated. By using tools or libraries that support these formats, developers can automatically generate synthetic data that conforms to the defined schemas.

Application of mathematical models

Mathematical models can also be used to simulate a wide variety of data structures. By applying statistical or probabilistic methods, developers can generate realistic test data that exhibit certain distributions or patterns. This is particularly useful when it comes to generating large amounts of data or testing specific scenarios.

Generation from database schemas

For applications that work with databases, test data can also be generated directly from existing database schemas. The structures and relationships of the tables are analyzed and corresponding synthetic data is generated. This ensures that the test data is both structurally and semantically correct and enables realistic test scenarios.

Special features and advantages of synthetic test data

Synthetic test data is characterized by a high data protection compliance, as it contains anonymized information and therefore does not allow any conclusions to be drawn about real persons. This is particularly important in order to comply with legal requirements and data protection guidelines such as the GDPR.

The data generated takes into account a wide variety of structures. They not only reflect standard data patterns, but also integrate complex structural subtleties and extreme situations. Such edge cases are essential for meaningful tests, as they check the robustness of systems of systems against unusual or incorrect inputs.

Another strength lies in the ability to create controlled specification violations: developers can specifically incorporate deviations from specifications in the test data. This procedure serves to immunize systems against unexpected or incorrect inputs and to uncover vulnerabilities at an early stage.

The combination of data protection security, structured diversity and targeted error induction makes synthetic test data an indispensable tool in the quality assurance of modern software systems.

Application examples and pilot projects with synthetic test data

Examples of the use of synthetic test data in practice include:

1. Electronic invoicing

Test projects for the implementation of electronic invoicing systems.
Use of synthetic data for test scenarios in the e-billing area.

2. Bank transfers and payment transactions

Conducting tests with XML-based formats such as CAMT for bank transfers.
Appraisal of the functionality of payment transaction protocols using synthetic test data.

These application examples show how versatile synthetic test data can be used, especially in areas such as electronic invoicing and financial transactions.

Technological implementation and service offerings for synthetic test data

The concept of a service for the automated provision of synthetic test data, also known as Test Data as a Service (T-DAS), offers an efficient solution for companies that rely on high-quality test data. These services allow customers to customize their data requirements via self-service platforms and receive tailored test data.

By using these technological implementations, companies can save time and resources as the generation of synthetic test data is automated and adaptable. This helps to make the process of software development and system testing more efficient and reliable.

Quality assurance and error detection with synthetic data in software testing

Synthetic test data plays a central role in system robustness testing, which goes beyond classic penetration tests. They make it possible to map scenarios that rarely or never occur in real production data, such as extreme edge cases or specific deviations from specifications. This allows vulnerabilities in the system to be identified at an early stage before they lead to errors in productive use.

Validation of the test data

Extensive validation measures are essential to ensure the quality of the test data before it is used. Validation includes structural checks as well as appraisal of data consistency and compliance with the defined specifications. Only such careful checks can ensure that the synthetic data represents realistic test scenarios and delivers valid results.

Advantages of combining test data and robustness tests

The combination of targeted generation of validated test data and systematic robustness tests significantly increases the reliability of the software. Synthetic test data creates the basis for reliable test results and supports developers in effectively identifying and eliminating potential sources of error.

Practical example of Mustang Tool in electronic accounting with synthetic test data

The Mustang tool plays a central role in the processing of electronic invoices in Germany, especially for X-Billing formats. It serves as a publicly available testing tool that ensures compliance with format requirements and syntax rules.

Key functionalities of the Mustang Tool include

Validation of XML-based invoice formats such as X-Bill and Draft Horse
Checking for correct structure and data integrity according to specifications
Identification of errors that could lead to rejections or system crashes

Despite its extensive features, the tool has limitations, such as the implementation of the rounding factor 0.05 for Hungarian currency units. This leads to valid invoices from Hungary being rejected. Such cases illustrate the need for synthetic test data in order to recognize and correct marginal cases and country-specific peculiarities at an early stage.

The use of the Mustang tool in conjunction with synthetically generated test data helps companies to ensure the compatibility of electronic invoice formats and avoid process interruptions in accounting.

Future prospects and industry development for electronic invoicing with the help of synthetic test data

Electronic invoicing is expected to be mandatory in various industries by 2028. This has implications for the development and use of synthetic test data in the following ways:

Industry-wide transition: The mandatory introduction of electronic invoicing by 2028 will cause a sharp increase in the use of synthetic test data in various industries.
Customization and reliability: Companies will need to convert their systems to electronic invoicing, and reliability and compatibility can be ensured through the use of synthetic test data.
Innovative solutions: The increasing demand for electronic invoicing requires innovative approaches in which synthetic test data can play a central role.

This development shows that the use of synthetic test data will continue to play an important role in the electronic invoicing industry in the future.

Conclusion on the advantages of synthetic test data in software testing

System compatibility and legacy systems
Information density in testing and the importance of data-driven approaches to improving test scenarios.

Häufige Fragen

What is synthetic test data and why is it used in software development?

Synthetic test data are artificially generated data sets that simulate real data but do not contain any sensitive information. They are used to comply with data protection guidelines, map complex test scenarios and replace live data with secure alternatives.

What are the challenges of generating test data, especially in the context of electronic invoices?

The generation of test data is often characterized by complex data requirements, especially in the case of electronic invoices. Manual data generation is time-consuming and cost-intensive. In addition, data protection regulations and anonymization requirements must be strictly adhered to, which makes the use of productive data more difficult.

Which methods are used to generate synthetic test data?

Various methods are used to generate synthetic test data, including the use of format descriptions such as XML or JSON, mathematical models to simulate different data structures and the generation of structured and valid data based on database schemas.

What advantages does synthetic test data offer over productive data in software testing?

Synthetic test data ensures data protection compliance through anonymization, allows the mapping of diverse structures and extreme values as well as targeted specification violations for robustness tests. They therefore enable more comprehensive and secure test scenarios.

How is synthetic test data used in practical applications such as electronic invoicing?

Synthetic test data is used in projects for electronic billing and bank transfers, for example for tests with XML-based formats such as CAMT. Tools such as the Mustang tool support the validation of X-billing formats in Germany using such data.

What are the future prospects for the use of synthetic test data in electronic invoicing?

The mandatory introduction of electronic invoices in various industries is expected by 2028. Synthetic test data will play a central role in quality assurance and system compatibility, especially when dealing with legacy systems and data-driven test scenarios.

Data and data processes

11/12/2024

Podcast Episode: Data and data processes In this episode, I talked to Joshua and Hermann about quality, test automation and agility. Hermann...

Podcast Software Testing

Test Data Management

03/26/2024

Podcast Episode: Test Data Management Test data - a tiresome topic for many companies, especially when it comes to cross-system provision. It has to...

Podcast Software Testing

Practical software testing with Java

10/21/2025

Anyone involved in software testing and quality wants a practical approach to the basics of testing methods. This can be found, for example, in the...

Podcast Software Testing