Test data management

Test data management refers to the structured provision, transfer and anonymization of test data across multiple systems. The aim is to enable testers to order data themselves without involving central groups. A standardized application model describes the technical building blocks such as customer or contract across systems and prevents redundant solutions.

Key Takeaways

Without consistent test data, test results are not meaningful, no matter how sophisticated the test methods and automation tools behind them are.
Five or six individual tools developed in-house for test data lead to redundancy and high maintenance costs, because every small change in the target system forces adjustments in all tools.
A central test data management system makes manual processes such as the complete environment setup superfluous, because testers can order data themselves via a web interface and schedule it with a click.
Data inconsistencies in test systems often remain undetected until they lead to program crashes because no mechanism actively reports when referenced entries are missing.

Why test data is the foundation of every test result

Test data determines the validity of a test. If the data is wrong, the test results are also invalid, no matter how well the methods and tools behind them work. In the discussion about test management, the focus often ends up on automation and tooling, while the database is tacitly assumed.

Patrick Olcha from Union Investment describes the starting point using the example of a highly parameterizable system. There were deviations in the parameters between production and the test environment. These parameters had an impact on the behavior of the application. Without an identical database, it is not possible to reliably say that a result from testing in production will be the same.

This led to a chain of further requirements. Environments had to be provided with fresh data. It had to be possible to reproduce production errors. And data could not be moved between environments at will because requirements from different areas restricted this.

How isolated solutions become a maintenance problem

Self-built tools solve individual problems, but generate follow-up costs. At Union Investment, this resulted in five to six different tools for test data. Maintenance was time-consuming, and even small changes in the target system meant that a lot of adjustments had to be made to the individual tools.

There was also redundancy. Two or three tools worked similarly, but not in the same way. This duplication was not intentional, but was forced by technical constraints. At this point, the realization arose that the administration had to run differently.

A second pattern exacerbates the problem in distributed landscapes: Data diverges across systems. At Union Investment, the core portfolio management system, a database management system, is connected to adjacent systems such as a data warehouse and various auxiliary databases. Cross-system data transfer was not possible with the old solutions.

Data inconsistency only becomes visible when it hurts

Inconsistent data within a system often goes unnoticed until a program stumbles across it. A typical case: A program expects related entries in two tables, but only finds the entry in table A because the corresponding entry in table B has been manually deleted or changed. The result is a program abort.

The tricky thing about this is the lack of feedback. In case of doubt, the person who caused the inconsistency does not even know, and the system does not actively report it. The damage only becomes visible after the abnormal end.

A pragmatic countermeasure is a clear definition of consistency with an escalation: if the definition does not apply, the affected data record is deleted completely in case of doubt. A cleanly deleted data record is better than an inconsistent one that continues to have an uncontrolled effect.

What model-based test data management must achieve

The central requirements can be summarized in a few points. They show what is important for a comprehensive solution:

Multiple database management systems must be supported, not just the core system.
Self-service for users: Activities that previously could only be carried out by a central group should now be carried out by the users themselves.
Mapping of the application model: The user selects what they need on a module-by-module basis, such as customer, contract or sales.
Reusability: Building blocks, data models or anonymization methods that have been described once should not have to be redefined for each system.

Reusability addresses the pain of old stand-alone solutions. A method for anonymizing names is defined once and then used across systems instead of having to build it again for each database system.

How a modeling layer works across system boundaries

An abstraction layer separates the domain-oriented model from the concrete database behind it. Danny Tamm from UBS Heiner describes the principle as follows: individual modules are assembled into a model without the underlying database playing a role. Where the data is physically located is defined once per environment.

Above this model is another layer for the end user. They see neither the database nor the model, but order their data. In between is the descriptive layer, with the database at the bottom.

The link across system boundaries arises at precisely one point, the common ordering term. For example, if there is a customer number in the core system and other data for the same customer number in the CRM, this key connects the application models of both systems. It is important to note that each model can also be used individually. If you are only testing in one system at a lower test level, simply omit the others.

Order test data instead of requesting it

The ordering principle works like an online store for data. The user uses a web interface to specify which data they need, such as a specific customer or account, as well as the source and destination of the movement. A click on Order triggers the processes in the background that read the source, anonymize it and write it to the destination.

The content of the store is defined by the team itself. They define what is anonymized and in which tables, which tasks are available to end users and who is allowed to do what. Some areas are reserved for a specific group of people, others can be executed by any user or sent by request.

Operation is deliberately kept simple, even if a lot happens in the background. When copying between environments, the screen has three fields: Source environment, target environment and the object to be transferred, such as a specific number.

With the simplest process, it feels like every area of the software is run through and then you have the data the way you want it.

Patrick Olcha, Union Investment

From simple deletion to a complete environment refresh

The use cases vary according to complexity and authorization. The simplest case is the deletion of a single inconsistent data record. This function did not exist in the past: either the entire environment was rebuilt or the inconsistent status remained.

The next case is copying between environments, for example between test levels or from production to recreate a production error. The most complex case is the refresh of an entire environment. This process consists of around eight consecutive steps, each of which accesses 20 to 300 tables.

This refresh was previously a manual process in a central group. Now it can be triggered with a click and scheduled for off-peak times via a scheduler. There is a tangible reason for this: data records in the tens of millions are transferred via JDBC for each table, and an insert of around 60 million data records takes time.

Use case	Complexity	Who executes
Delete individual data record	Low	End user himself
Copy data between environments	Medium	End user himself
Refresh entire environment	High (approx. 8 steps, 20-300 tables per step)	Restricted group of people

A proof of concept lives from an existing basis

Migration is faster if experience and a model are already available. Union Investment had a starting point from its own fund system and was able to migrate it to the target system. Only two or three errors occurred, after which the model was mapped in the target system.

The model can be gradually improved from this basis. The current status is a monolith of around 300 tables that belong together, but are not yet broken down by customer and area. Parallel to the secured initial basis, smaller application models are created as islands that can be linked together as required.

A final state is not the goal. New ideas and possibilities are constantly emerging as the software is used. This openness is intentional, because that is exactly what the tool is there for.

Acceptance arises through self-service and new functions

The positive response comes from two main sources: Users can do things themselves, and they get functions that didn’t exist before. The independent deletion of individual securities accounts is an example of a function that was simply missing before. Instead of making a request or contacting others, users click through themselves.

The rollout is running via friendly users. Initially, 20 to 30 users are using the system intensively; the next step is to roll it out more broadly so that significantly more people in the organization can access it. At the same time, the manual processes surrounding the inventory management system will be completely replaced.

Other systems are also being connected. In addition to those already connected, there is a fourth candidate. How quickly this happens depends on resources and budget, such as whether the respective database management systems are already licensed.