Test data and data processes

Production data often only covers 70% of test cases. How to systematically test data processes and get business and IT on the same page.

Richard Seidl • Nov 12, 2024 • 9 min read

Testing data processes means checking machine-generated output data against a target result that is derived directly from the technical specification. This creates equivalence partitions from case distinctions that automatically generate test coverage. Experience shows that production data covers around 70 percent of cases; the remaining 30 percent is added synthetically.

Key Takeaways

Production databases only cover around 70 percent of case constellations in testing because rare borderline cases simply do not occur there and therefore remain undetected.
Specialist departments can only reliably detect data errors in processing procedures if the results are presented visually and comprehensibly without SQL knowledge.
Automatically generated target results based on the functional specification replace manual individual test cases and provide a higher test coverage with the same effort.
The set of rules for data comparison can also serve as a detailed functional specification because it describes the mapping logic precisely and in a way that is understandable for non-technicians.
AI can be used as a useful assistant in this context, for example to generate SQL queries and display them visually, but does not relieve the specialist department of responsibility for checking the content.

Data errors appear too late if the specialist department cannot access the data

In many companies, errors in data processing processes are only discovered too late because knowledge of the data and access to it diverge. The IT department has no problem accessing the data technically, but does not have the technical understanding to recognize data errors in terms of content. The specialist department would have the understanding, but not the access.

In practice, this leads to a cumbersome ping-pong. The specialist department comes up with individual test cases, receives data for them and checks them manually. This only covers a fraction of the possible cases and costs a lot of time.

The later an error is discovered, the more effort it generates. This is precisely why it pays to start testing data as early as possible, ideally during the development of the processing logic.

What distinguishes data quality from testing data processes?

Data quality and testing data processes are two different tasks. In the case of pure data quality, data is delivered without knowing how it was created and can only be checked for plausibility.

An example: In a list of living persons, a date of birth of 1810 is unlikely. More than this plausibility check is not possible because neither the input data nor the mapping rule are known.

When testing data processes, on the other hand, there is a technical specification of how input data should be transformed into output data. Here, a target result can be formed and held against the actual result. The question is then: Was the output data produced in accordance with the functional specification?

Systematically generate target results instead of building individual test cases

The more effective approach is to systematically generate target results and fully compare them with the actual results instead of defining individual test cases by hand. This is based on business rules that the specialist department has to define anyway in order to describe the data mapping.

These rules can be used to automatically generate a target, which can even be applied to the complete production database. Every row and every column can be checked in this way.

The effort involved remains manageable. If you want to achieve the same coverage with manual test cases, it will take at least as long and the test quality will still be poorer because you cannot map as many cases.

Production data only covers part of the cases

If automatic target generation is applied to the production data, this often only covers around 70 percent of the relevant cases in practice. The missing 30 percent must be added in order to achieve extensive coverage.

The reason for this lies in the technical logic itself. The case distinctions form equivalence partitions in a natural way. If there is no suitable data for a case distinction, for example for the case “age greater than 90”, it is not known whether the test item is working correctly in this case.

Such gaps only become apparent when a real case occurs later and goes wrong. Those who only use production data simply do not test these constellations. This is why the anonymized 70 percent from production is specifically enriched with the missing constellations.

For ongoing testing, you build a test set that covers all case constellations with one representative each and runs through them quickly. The full set is tested as a separate quality assurance of the production set.

It is not a double implementation if you only replicate the mapping

A common objection to data comparisons is that the system is replicated a second time, thereby introducing the same sources of error. This is not the case with the approach described here, because only the technical mapping rule is reproduced, not the system.

Most mappings are extensive, but not complex. It is all about case distinctions: If this, then that, otherwise something else. Such logic can be mapped directly from the specification without having to worry about performance or other constraints.

The difference in effort is clear. Where generating the test result takes a day, implementing the system with more people takes one to two weeks. You don’t do more, but above all you don’t do much.

We don’t rebuild the entire system. We use the specification and recreate the mapping rule directly with the specification. It’s not about when exactly what has to happen, we just test the mapping. Joshua Claßen

For really complex calculations, a different approach applies. In a complex simulation, you get verified target results from an independent source and compare them with a defined tolerance.

Data from many delivery systems must be merged before testing

In heterogeneous system landscapes, there is rarely just one input system. Several delivery systems feed in data that must be harmonized and transferred to a central interface.

One example is a money laundering check on transaction data. The input data from the various delivery systems is brought together and delivered to this interface. It is precisely this process of merging that can be tested using the same principles as a single mapping.

Why data comparisons need tolerances

A data comparison is not always an exact one-to-one comparison. In certain cases, small deviations are technically acceptable and must be tolerated via defined thresholds.

In a trading system at banks, for example, cash flows are generated algorithmically. Due to numerical properties, the same result can deviate by one or two cents depending on the sequence of calculation operations. Such deviations are not critical as long as they do not rise above acceptable thresholds.

Tolerances do not only affect numbers. A tolerance can also be useful for text fields, for example if upper and lower case should not play a role in the comparison.

Technical rules that are also the specification

The set of rules used to generate the target can also be the detailed specification. In this way, the test basis and the functional specification are combined in one artifact instead of living in separate, divergent documents.

Such a rule does not require any technical understanding, only natural understanding. An example of a naming rule:

If the first name and last name are not empty, both are output separately.
If only the first name is filled in, only the first name is output.
If only the surname is filled in, only the surname is output.
In all other cases, an error is displayed.

This approach has a double benefit. If the department sees the results of the defined logic early on, it immediately recognizes if a case distinction is missing. An implementation can be correct to the specification and still deliver incorrect results because the specification itself is incomplete.

Visualization brings business and IT on the same wavelength

Data comparisons must be visualized so that the business department understands them and both sides can work on the same object. A technically expressed document separates business and IT, a visible definition of the data comparison connects them.

The patterns are repeated during data testing. Testing always means comparing a target with an actual. Most data is available in tabular form, but can also be compared as XML or JSON. Because these patterns occur constantly, they can be implemented more quickly and easily with low-code components than having to reprogram them for each department.

It is precisely this redundancy that can be observed in large banking customers with many departments. Each one builds its own CSV comparator, each one reinvents the wheel. Nobody writes their own word processing either.

Even those who know SQL benefit from good visualization. It is quicker to grasp than pure code. If you want to query 99 out of 100 columns, you have to specify all 99 in SQL; in a graphical user interface, you click away one column.

AI makes testing more efficient, not superfluous

AI does not replace data testing, it speeds it up. An AI system can make suggestions, such as specifying case distinctions at intervals or designing a test, but the result must remain verifiable.

This is precisely where visualization is the lever. If an AI only generates code, you have to be able to read code in order to verify the suggestion. If, on the other hand, the result is presented visually, even someone without in-depth coding knowledge can quickly assess whether it fits and correct the rest themselves.

There is a hard limit for sensitive company data. Such data is not sent to an external service on the Internet. The development is therefore moving towards more efficient, locally usable language models that can be used to generate SQL from a natural language requirement and a visualized no-code query.

AI is also suitable as a co-pilot for the operation itself. It can provide suggestions on how to use a tool or answer a question on how best to solve a specific task. However, a completely independent statement “here is my system, test it” is not in the near future.

Frequently Asked Questions

How do you ensure that test data remains up-to-date and relevant?

To ensure that test data remains current and relevant, regular reviews and updates are crucial. Test data should be regularly compared with real data and adapted to changes in the requirements or in the software. Automated tests can help to generate test data dynamically and ensure that it always corresponds to the current usage scenarios. Close cooperation with the specialist departments also helps to identify and create relevant test data. This guarantees the quality of the tests.

When does it make sense to use synthetic test data instead of real production data?

It makes sense to use synthetic test data when privacy is important or when real production data is not available. Synthetic test data can also be used to test specific scenarios that are difficult to reproduce with real data. They also enable tests to be carried out cost-effectively and securely without the risk of data leakage or misuse. When developing and testing new functions, they are often more flexible to use than real production data.

What are the most common challenges in test data management?

The most common challenges in test data management are ensuring data quality and integrity. There is often a lack of realistic test data that covers relevant scenarios. Data protection regulations make it difficult to use real data, while generating synthetic test data can be time-consuming. In addition, the management and provision of test data for different test environments is often uncoordinated. Ultimately, a lack of automation leads to inefficient processes that lengthen the test cycle.

What different types of test data are there and how are they used?

There are different types of test data that are used in software development. These include 1. real-time data: Real data from production for accurate simulation. 2. batch or dummy data: Randomly generated data for extensive testing. 3. limit value data: Test data that lies at the limits of the input values. 4. positive and negative testing: data representing correct and incorrect inputs. This test data helps to effectively test the functionality, security and performance of software solutions.

What is meant by the pseudonymization of test data?

Pseudonymization of test data means changing personal data in such a way that it can no longer be assigned to a specific person without additional information. In the case of test data, identification features are replaced by fictitious values to ensure data protection. This allows test results to be analyzed without endangering the privacy of the persons concerned. Pseudonymization is particularly important in software development and data analysis in order to comply with legal requirements.

What is test data anonymization and why is it important?

Test data anonymization is the process of masking or altering personal or sensitive information in test data to protect the identity of individuals. It is important to comply with privacy laws and build trust, while allowing developers and testers to use realistic data without compromising privacy. Anonymization allows companies to work more securely while ensuring the quality of their software.

What are test data generators and how do they create test data?

Test data generators are tools that automatically create test data to test software applications. They generate structured data in various formats based on defined rules or templates. This allows developers to simulate realistic scenarios without having to enter data manually. Test data can be generated randomly or rule-based to meet specific requirements. As a result, these generators save time and minimize sources of error while ensuring that the tested systems function in real time.

What is synthetic test data and how is it created?

Synthetic test data is artificially generated data that is used to analyze and validate software applications. It imitates real data without containing confidential information. It is created using algorithms that mimic the patterns and structures of real data. This can be done using techniques such as data anonymization, statistical modelling or scripting. Synthetic test data makes it possible to carry out tests without taking data protection risks and is particularly useful for software development and quality assurance.

What is test data and how is it used?

Test data is specific data that is used to check software applications during the testing process. It helps to evaluate the functionality, performance and security of an application. Test data is created to represent different scenarios, including normal and exception conditions. They can be generated manually or automatically. The use of test data enables testers to identify errors and ensure that the software meets the desired requirements before it is released.

Share this page

Back to all posts

Positive Leadership: What It Is—and What It Isn’t

Richard Seidl

•

Jun 30, 2026

Positive Leadership: What It Is—and What It Isn’t

What AI Really Does to Trust and Team Dynamics