Data and data processes
Podcast Episode: Data and data processes In this episode, I talked to Joshua and Hermann about quality, test automation and agility. Hermann...
The generation of test data brings with it challenges that many companies face in the field of software development. The need to create complex and privacy-compliant test data is becoming increasingly urgent, especially in light of new regulatory requirements such as the GDPR. One approach combines basic research with practical applications to generate structured data through automated models. This not only provides quality and integrity of the test data, but also enables efficient error identification. The implementation of electronic invoices shows the potential to optimize existing systems and identify possible sources of error at an early stage.
In this episode, I talk to Dominic Steinhöfel about a topic that affects many software projects: test data. As the founder of a start-up that specializes in generating synthetic test data, he has a fresh perspective. We discuss how his methodology works and the benefits it brings to companies. It becomes clear that the topic of test data is not only important, but also extremely complex.
"Testing has always been important, and it's somehow becoming more and more important, especially now with AI." - Dominic Steinhöfel
Dominic Steinhöfel is CEO and co-founder of InputLab, an innovative spin-off of the CISPA Helmholtz Center for Information Security. InputLab develops synthetic test data generated from existing specifications such as XML or JSON schemas to make software testing more efficient and effective. InputLab's data-driven solutions help to discover new bugs without using sensitive customer data - a fully GDPR-compliant approach that guarantees both security and flexibility.
Synthetic test data are artificially generated data sets that are used to test software and systems. They are created through targeted data modelling and simulation of realistic but anonymized data sets.
Synthetic test data enables extensive testing without the use of sensitive productive data. This test data plays a central role in modern software development, as it rationalizes test automation and helps to detect data errors.
The creation requires a deep understanding of the data structures as well as the legal data protection requirements. A well thought-out data architecture can help overcome these challenges.
Productive data often contains personal information, the use of which is legally problematic in the test environment. Synthetic data ensures data privacy and at the same time fulfills complex requirements for validity and structure.
In software development and system testing, they are used to test functions under realistic conditions, detect errors and test robustness against extreme situations.
The ability to generate test data accurately forms an important basis for secure and reliable IT systems. Effective test data management is therefore essential for companies to successfully meet this challenge.
Test data generation faces several significant challenges, especially when it comes to complex data requirements such as electronic invoices. The complexity arises from the variety of formats and specific regulations that must be met in order to generate realistic and valid test cases.
Manual data generation is a time-consuming and costly process. Employees have to create a large amount of test data with different properties, which not only ties up resources but also increases sources of error. In such scenarios, the role of the test manager in agile projects becomes particularly important, as they play a key role in quality assurance and efficient team collaboration.
Legal requirements for data privacy and anonymization pose further hurdles. Personal or sensitive data from production systems may not simply be used for testing. Although this protects privacy, it makes it more difficult to use real data for realistic tests.
The use of production data often leads to problems: In addition to data protection concerns, undetected errors can occur in the system if the test data does not reflect all relevant boundary conditions or extreme situations. Synthetic test data therefore offers a safe alternative that can be tailored precisely to the requirements.
There are various methods for generating synthetic test data that can be used depending on the use case and requirements. Here are some of the most common methods:
One way to generate test data is to use format descriptions such as XML or JSON. These formats provide a clear structure and definition for the data to be generated. By using tools or libraries that support these formats, developers can automatically generate synthetic data that conforms to the defined schemas.
Mathematical models can also be used to simulate a wide variety of data structures. By applying statistical or probabilistic methods, developers can generate realistic test data that exhibit certain distributions or patterns. This is particularly useful when it comes to generating large amounts of data or testing specific scenarios.
For applications that work with databases, test data can also be generated directly from existing database schemas. The structures and relationships of the tables are analyzed and corresponding synthetic data is generated. This ensures that the test data is both structurally and semantically correct and enables realistic test scenarios.
Synthetic test data is characterized by a high data protection compliance, as it contains anonymized information and therefore does not allow any conclusions to be drawn about real persons. This is particularly important in order to comply with legal requirements and data protection guidelines such as the GDPR.
The data generated takes into account a wide variety of structures. They not only reflect standard data patterns, but also integrate complex structural subtleties and extreme situations. Such edge cases are essential for meaningful tests, as they check the robustness of systems of systems against unusual or incorrect inputs.
Another strength lies in the ability to create controlled specification violations: developers can specifically incorporate deviations from specifications in the test data. This procedure serves to immunize systems against unexpected or incorrect inputs and to uncover vulnerabilities at an early stage.
The combination of data protection security, structured diversity and targeted error induction makes synthetic test data an indispensable tool in the quality assurance of modern software systems.
Examples of the use of synthetic test data in practice include:
These application examples show how versatile synthetic test data can be used, especially in areas such as electronic invoicing and financial transactions.
The concept of a service for the automated provision of synthetic test data, also known as Test Data as a Service (T-DAS), offers an efficient solution for companies that rely on high-quality test data. These services allow customers to customize their data requirements via self-service platforms and receive tailored test data.
By using these technological implementations, companies can save time and resources as the generation of synthetic test data is automated and adaptable. This helps to make the process of software development and system testing more efficient and reliable.
Synthetic test data plays a central role in system robustness testing, which goes beyond classic penetration tests. They make it possible to map scenarios that rarely or never occur in real production data, such as extreme edge cases or specific deviations from specifications. This allows vulnerabilities in the system to be identified at an early stage before they lead to errors in productive use.
Extensive validation measures are essential to ensure the quality of the test data before it is used. Validation includes structural checks as well as appraisal of data consistency and compliance with the defined specifications. Only such careful checks can ensure that the synthetic data represents realistic test scenarios and delivers valid results.
The combination of targeted generation of validated test data and systematic robustness tests significantly increases the reliability of the software. Synthetic test data creates the basis for reliable test results and supports developers in effectively identifying and eliminating potential sources of error.
The Mustang tool plays a central role in the processing of electronic invoices in Germany, especially for X-Billing formats. It serves as a publicly available testing tool that ensures compliance with format requirements and syntax rules.
Key functionalities of the Mustang Tool include
Despite its extensive features, the tool has limitations, such as the implementation of the rounding factor 0.05 for Hungarian currency units. This leads to valid invoices from Hungary being rejected. Such cases illustrate the need for synthetic test data in order to recognize and correct marginal cases and country-specific peculiarities at an early stage.
The use of the Mustang tool in conjunction with synthetically generated test data helps companies to ensure the compatibility of electronic invoice formats and avoid process interruptions in accounting.
Electronic invoicing is expected to be mandatory in various industries by 2028. This has implications for the development and use of synthetic test data in the following ways:
This development shows that the use of synthetic test data will continue to play an important role in the electronic invoicing industry in the future.
Synthetic test data are artificially generated data sets that simulate real data but do not contain any sensitive information. They are used to comply with data protection guidelines, map complex test scenarios and replace live data with secure alternatives.
The generation of test data is often characterized by complex data requirements, especially in the case of electronic invoices. Manual data generation is time-consuming and cost-intensive. In addition, data protection regulations and anonymization requirements must be strictly adhered to, which makes the use of productive data more difficult.
Various methods are used to generate synthetic test data, including the use of format descriptions such as XML or JSON, mathematical models to simulate different data structures and the generation of structured and valid data based on database schemas.
Synthetic test data ensures data protection compliance through anonymization, allows the mapping of diverse structures and extreme values as well as targeted specification violations for robustness tests. They therefore enable more comprehensive and secure test scenarios.
Synthetic test data is used in projects for electronic billing and bank transfers, for example for tests with XML-based formats such as CAMT. Tools such as the Mustang tool support the validation of X-billing formats in Germany using such data.
The mandatory introduction of electronic invoices in various industries is expected by 2028. Synthetic test data will play a central role in quality assurance and system compatibility, especially when dealing with legacy systems and data-driven test scenarios.
Podcast Episode: Data and data processes In this episode, I talked to Joshua and Hermann about quality, test automation and agility. Hermann...
Podcast Episode: Test Data Management Test data - a tiresome topic for many companies, especially when it comes to cross-system provision. It has to...
Podcast Episode: How to Survive a Cloud Migration? The cloud offers benefits such as scalability and cost savings, but also challenges such as data...