Assessing Data Quality

This article covers an industrial report, which stems from practical experience in assessing the quality of customer databases. The process it describes unites three automated audit:,an audit of the database schema, an audit of the database structure and an audit of the database content. The audit of the database schema checks for design smells and rule violations. The audit of the database structure measures the size, complexity and quality of the database model. The audit of the database content processes the data itself to uncover invalid data values, missing records and redundant records. The purpose of these audits is to assess the quality of the database and to determine whether a data reengineering or data clean-up project is required.

The Loss of Data Quality

The existing databases of most IT users are the product of a long evolution. It began with their conception by a database designer on the basis of the information available at that time. This was followed by the hand coding or the automatic generation of a database schema reflecting the design. After that the database was installed and applications began to use it. As new applications came along, the structure was altered or extended to accommodate them. These structural changes were often made in an ad hoc manner, so that the complexity of the structure grew and the quality sank, much like software systems according to the laws of software evolution ¹. In addition to this structural deterioration, the contents of the database became corrupted by erroneous programs storing incorrect values and deleting essential records. With large databases it is difficult to recognize such quality erosion, but over time it spreads like a cancerous infection causing more and more system failures. Thus not only program quality but also data quality suffers from erosion ².

Failure to Foresee Data Evolution

At the beginning of a database life cycle the designer is obliged to make assumptions about the data quantity and the way the data will be used. He must also foresee how the data will evolve. The design of the database is based on such assumptions. The same assumptions are carried over into a hand coded database schema in which the tables are defined and linked to one another by foreign keys and indexes. If the assumptions prove to be right, the database can evolve with no problems. If not, the database structure will soon be obsolete and the form will no longer fit to the content. It will also become increasingly difficult to adapt. If the data volume grows much more than was expected and the data usage turns out to be quite different from what was originally foreseen, it becomes necessary to reengineer the database ³.

Lack of Data Independence

Often enough database designers cannot predict the actual growth of a database, nor can they foresee how it will really be used. When they first design the database, they are likely to design it solely from the viewpoint of a single application and not from a global viewpoint of all possible applications. Thus, from the beginning, the data is not independent of the applications using it. Later the structure has to be adapted to accommodate more and different applications, but the further the design is stretched, the more brittle it becomes. At some point it is no longer adequate for any of the applications ⁴.

Violation of the Normal Forms

Each provisionary patch to the database structure has negative consequences. Instead of creating new sub-tables when additional attributes are required, the local database administrator will simply add them to existing tables. As a result, the rows of the databases become increasingly longer and harder to handle. When new keys are needed, they are added as indexes. As a result there are more and more indexes. Data groups are added to existing rows, thereby violating the second normal form, and repetitions of the same data items are made in order to avoid creating new sub-tables, thus violating the principle of first normal form. Additional foreign keys are added which increase the number of dependencies between tables, and sub-keys are added to the primary key, which often do not pertain to the rows as a whole. This amounts to a violation of the principle of third normal form. In the end the database no longer conforms to the normal forms for relational databases with a subsequent loss of quality and an increase of complexity ⁵.

Misuse of Stored Procedures

Another feature contributing to a loss of portability and data independence is the use of stored procedures. If stored procedures were only used for access operations and to ensure the integrity of the data, it would be acceptable. However, they are often misused to perform processing of logic. Rather than putting business rules in separate rule modules or placing checking routines in the client components, developers hide them in the stored procedures where they are no longer visible to the tools processing the programs. Stored procedures full of selections and loops, sets and declares, are a definite sign of misuse. They bind the database to a particular application ⁶.

Inconsistent and Incorrect Data Updates

The contents of a database suffer not only from errors when executing the programs but also from inconsistent changes. The type of a data attribute contained in two different tables may be changed in one table but not in the other, or the value of an attribute is no longer compatible with the type of that attribute in the using program. The Frankfurt Stock Market once had to be closed down for half a day because of an invalid value in the database leading to a data exception error in the processing job ⁷. Unfortunately, there are much more bad data in existing databases than users assume, and this bad data can at any time lead to a system crash.

Another problem with the content of databases is that of missing and redundant records. Missing records are those which are still referenced, but which have been either intentionally or unintentionally deleted. They may have been deleted by one application, but are still needed by another. Redundant records are those which should have been deleted, but for some reason were not. They remain as corpses in the database occupying valuable space, even though they are no longer used ⁸.

The quality deficiencies noted here are only a subset of the many quality problems with application databases. Low database quality has become a major concern for IT users and is the motivation for the three audit processes presented in this article:

Analyze the database schema
Measuring the database structure
Validation of the database content

Static Analysis of Database Schemas

The database model needs to be statically analyzed to detect quality deficiencies as well as to measure the size, complexity and quality of the database structure. To this end, an automated auditing process is required.

The first step in auditing a database is the static analysis of the database schema with the goal of detecting rule violations. Just as with program code, database schema code needs to be governed by rules to enforce good coding practice ⁹. The rules are, on the one hand, directed toward the fulfillment of certain data quality goals such as security, compatibility, performance and portability. On the other hand, they are intended to make it easier to change and extend the database.

In order to assign rule violations different weights, it is necessary to classify them. The database auditor distinguishes between major, medium and minor rule violations. Major rule violations like violating the first normal form are a major barrier to the further development and re-use of the data. They may also cause failures. Medium rule violations like exceeding a given size limit reduce the modularity and maintainability of the database. Minor rule violations such as naming convention violations make it difficult to maintain the data and keep it consistent with the code. Major deficiencies are weighted by 2, medium ones by 1 and minor ones by 0.5.

Checking Vendor Specific Features

There are many features of a database which are vendor specific, i.e. they are not standard SQL. They should be used with care. Some options should be forbidden because they effect performance or deter the transfer of the data. Others, like the NULL option, should be mandatory for the same reasons. It is up to the responsible database analyst to identify which features should be prohibited and which should be mandatory.

Checking the Normal Forms

Avoiding vendor specific features may lower performance, but it makes the database more independent of the vendor, i.e. the data becomes more portable. Having a primary key in every record may not be required by the SQL language, but it can be a rule to be enforced. Having a foreign key is also not necessary, but if the table is dependent on another table, at least one should be defined. In no case should a table have repeated data of the same type or sub-groups of data. Such obvious violations of the normal forms can be recognized and identified as bad practice by means of static analysis ¹⁰.

Setting Limits

There may also be limits as to how many attributes are allowed in a set of data and how long a row may be. It is necessary to check that these limits are not exceeded. Furthermore, it may be desirable to forbid certain data types such as packed-decimal, binary, floating-point or time-stamp data for the sake of compatibility with other databases of another vendor. Incompatible data types are one of the leading causes of errors in data migration ¹¹.

In summary, one can conclude that there should be a set of rules for designing databases and that these rule should be enforced. Typical rule violations are:

Unique identifier is missing in the table
Foreign key missing in dependent table
Table contains a repeated data attribute
Table contains a sub-group
Table has no external view defined
Table has no indexes
Primary key contains too many subkeys
NULL option is missing
DELETE option is missing
Table has an incompatible data type
Number of attributes exceeds maximum limit
Length of the line exceeds the maximum permissible length
Scheme is insufficiently commented

It is the task of the schema-auditing tool to check such rules and to report their violations. It is then the job of the database administrator to correct the database schema accordingly.

Measuring the database structure

To evaluate the architecture of a database, it must first be measured in order to base the evaluation on solid numbers. Without numbers it is not possible to compare, and without being able to compare, it is not possible to judge ¹². The measurements of an existing database can be compared with those of a benchmark database or simply with those of another database. Data measurements can be simple counts, or they can be complex metrics for size, complexity and quality ¹³.

Database Size Metrics

As for the rule checking, a tool is needed to perform the measurement, to count the database attributes and to compute the metrics. There are several database attributes to be counted. In an SQL database they include:

Lines of SQL code
Number of tables
Number of attributes
Number of keys
Number of views
Number of relationships
Number of stored procedures
Number of stored access operations
Number of stored procedure statements
Number of integrity rules
Number of rules for stored procedures
Number of test cases required to cover the database

From these measures two size metrics are computed:

Function points
Data points

The function points of a database are either 15, 10 or 5 depending on the number of attributes and keys ¹⁴. Databases with over 50 attributes or 2 keys get 15 function points, databases with 20 to 50 attributes or more than one key get 10 function points, databases with less than 20 attributes get 5 function points. The data points are 1 for each attribute, 2 for each key and index, and 4 for each view ¹⁵. These metrics are used to predict the costs of evolution projects that involve changes to the data.

Database Complexity Metrics

In addition to the size metrics, complexity and quality metrics should be computed. All metrics are ratio metrics derived from a relation of one set of attributes to another. They are computed using algorithms for comparing actual values to benchmark values based on empirical studies of similar databases. For each complexity metric there is a lower and an upper bound. The median complexity is the geometric mean of the two boundary values. The six computed complexity metrics are:

Content complexity as the relation of data types and data keys to the number of data attributes
View complexity as the relation of views to the number of tables
Access complexity as the relation of access paths to the number of access object views and tables
Relational complexity as the relation of tables to table relationships to the number of tables
Structural complexity as the relation of structural elements (tables, views and indexes) to the number of basic elements (attributes and keys)
Storage complexity as the relation of attributes stored to the size of the storage area in bytes

The weighted average of these six complexity metrics gives the overall complexity of the database on a ratio scale of 0 to 1. The complexity ratio should be as low as possible ¹⁶.

Database Quality Metrics

The desired qualities of a database have been propagated in the literature on this subject, in particular by J-L. Hainaut ¹⁷, Peter Aiken ¹⁸ and Michael Blaha ¹⁹. These and other authors emphasize the importance of data independence, data flexibility, access ability and data portability. One of the greatest detriments to software migration is the intertwining of business logic with access logic. One of the greatest barriers to data evolution is the inflexibility of the database structure. During the 1980’s both Dominique Warnier ²⁰ and Michael Jackson ²¹ propagated data driven software design, a method by which the program structure is derived from the data structure. The results were programs that reflected the current hierarchical and networked data models. When it became time to move to relational databases, the program structures no longer fitted. They had to be reengineered at a high cost. What was once a good idea to come up with a quick solution, turned out to be the greatest barrier to system modernization. Even today the structure of many relational databases is based on the requirements of the original application. It was not foreseen how the data could be used in the future ²².

The solution to the data independence problem is that each application should have its own views of the data. Views were defined to make this possible ²³. Of course, this increases the complexity of the database, but there are always trade-offs to be made. Here the trade-off is between database independence and data complexity. The same applies to data accessibility which requires more keys and more indexes to be able to access the data in many ways. Data flexibility implies that the database can be easily changed and extended. For that, there should be many smaller tables rather than fewer big ones. However, like with object-oriented programming, this causes more relationships between the tables and with that higher complexity ²⁴. Storage efficiency means storing more data in less space, but that too increases complexity. This shows that there is hardly any way to bring up the quality of a database feature without causing higher complexity or sacrificing the quality of another feature. This is a dilemma which database designers have to resolve. Attaining high quality, while maintaining low complexity is an optimization problem which depends on the context in which the data is used. If data is used in different contexts, a best possible compromise has to be found.

For each quality metric there is also a lower and an upper bound derived by comparing various databases. The median quality is the geometric mean of the highest and lowest quality values. The six computed qualities are:

data independence as the relation of data views and indexes to the number of data tables
data accessibility as the relation of access paths to the number of stored attributes
data flexibility as the relation of tables, keys and indexes to the number of data attributes
data storage efficiency as the inverted relation of stored attributes to the size of the storage area in bytes
data conformity as the relation of rule violations to the number of schema lines

The weighted average of these quality metrics gives the quality coefficient for the database as a whole. It should be as high as possible. According to the quality measurement scale of ISO standard 9126 scores above 0.8 are excellent, from 0.6 to 0.8 good, from 0.4 to 0.6 satisfactory and under 0.4 unsatisfactory ²⁵.

Validate the Database Content

Validating the content of a given database requires an oracle ²⁶. There should be someone or something that declares what should be in the columns of a table in general and in what line in particular. This oracle could be a person, a program, an interpretable specification or another database. A person could scan through the database and visually check the values of each row or selected columns or selected rows. There are browsers to support this, but it is a very tedious and time consuming job, even with a good browser. In the case of very large databases, it is practically impossible. Therefore an automated data validation process is required.

Validating data content by means of a customized program

One alternative is to write a program which will read through the database, fetching the next record or selecting records according to some search criteria. Then the contents of the record are compared with the rules coded into the program or with the contents of another table used as a reference. This method is appropriate, but it requires time and effort to prepare. The program must be written by a programmer with knowledge of the database and how to access it. Not only that, but this solution may also be erroneous and has to be debugged.

Validating data content with a test script

Another alternative is to have an interpretable specification – a script – which can easily be composed by someone familiar with the data. In the script, rules are formulated for validating the rules of selected attributes in selected records. This assumes, of course, that the auditor who writes the script knows what should be in the database. To support him, the auditor can use another database as a basis of comparison, but the auditor must establish the connections between two databases. To write an interpretable specification, the auditor needs a language with which he/she can:

specify which records with which keys should be in a table. The keys can be taken from the foreign keys of associated tables
specify which records with which keys should not be in a table. This can be achieved by comparing tables with one another
compare the content of a selected column in one table with the content of a corresponding column in another table
compare columns of the same table with each other
compare the values of a column with a given constant value
compare the values of a column with a range of values, i.e. boundary analysis
compare the values of a column with a set of expected values, i.e. equivalence class
compare the values of a column with the result of an arithmetic or logical expression
compare the values of a column with the concatenated values of other columns and constants ²⁷.

All of the asserted comparisons listed above can be a subject to a logical expression, i.e. they will only be executed if that condition is fulfilled. The condition makes the values of one column dependent on the values of another column or on the primary key, e.g.

This_Column E RANGE (1:50) if (Mainkey > 1000);

There are many variations to the way these assertions can be formulated. The auditor can define discrete values as well as value ranges and equivalence classes, relations between data values and results of computations. The assertion language is based on the Schematron schema validation rules as defined in ISO/IEC Standard 19757 ²⁸.

Execute the Data Validation

The data auditor formulates his assertions about the data in an assertion script for each database table and submits it to the compiler. The compiler checks the syntax and confirms the data names against the database schema. If errors are found, the auditor must correct the script. If the script is correct, an intermediate code is generated with the qualified data names and constant values referred to. Afterwards a validation run is started for each target table. The validation job processes the target table sequentially and checks one row after the other against the specified rules. If a row is missing or redundant, or if it contains invalid values, it is listed out by key in the validation report together with the expected values. With this report, the auditor can easily recognize incorrect content and prepare appropriate error reports. The data correctness ratio is the inverse relationship of the number of erroneous attributes relative to the total number of attributes in the database ²⁹.

Summary

The two approaches to evaluating data quality have already been used in several measurement and testing projects with positive results. However, until now the two approaches have never been used together. The data assessment process proposes combining static schema analysis with data content validation ³⁰. First the rules have to be defined together with the users and the metrics weighted according to the user goals. Secondly, the database schemas will be statically analyzed and the structures measured. Thirdly, the validation scripts will be composed together with the customer based on what the customer believes the values of the columns should be. Finally, the database contents are examined and the exceptions documented. After that a report can be written on the state of the data quality.

The outcome of a data quality assessment depends to a great extent on the setting of the benchmark measurement values and the weighting of the metric values. These may be adjusted prior to running the measurement jobs. The user may attach more importance to data accessibility and less to data independence, or he may consider data flexibility to be more important than testability. Users may also be more or less tolerant of improved data contents. For one user it may be sufficient to have a correctness ratio of 95%, for another nothing less than 100% is acceptable.

These processes are based on practical experience from many projects of the last years. Because of the lack of tools for assessing the different aspects, a set of self-developed tools grows with these projects. E.g. the tools DLIAudit, ADAAudit and SQLAudit check the rules for IMS, ADABAS, DB-2, Oracle and MS-SQL databases. The result is a deficiency report of schema rule violations. SQLAudit also counts about 35 database attributes and computes 6 complexity and 6 quality metrics. All 12 metrics are ratio metrics derived from a relation of one set of attributes to another. The database metrics are collected in the form of a metrics report. In addition, they are exported as an XML file to be useable to anyone who wants to use them, for instance to store in a metric database or to further process. These metrics can create the basis for a benchmark, without which it is impossible to really evaluate anything. The tool Datatest offers a language based on predicate logic expressed in assertions with which the auditor can specify the data content. This was often used for validating the content.

On the bottom line, quality is relative. It is relative to the quality goals set. That is why the Goal-Question-Metric model established by Basili and Rombach long ago is still valid ³¹. The user must first set goals, then he/she must ask how to judge their fulfillment and thirdly he/she can define metrics to measure the degree of their fulfillment. This method applies equally well to the assessment of data quality. The data analyst must begin by setting goals, goals such as portability, extendibility and flexibility. Then metrics must defined to measure the degree to which they are fulfilled by the current database. Finally, they must be measured and evaluated. Without clearly defined, measurable goals, it is not possible to judge anything.

Belady, L., Lehman, M.: "A Model of Large Program Development," IBM Systems Journal, Vol. 15, No. 3, 1976.
Blaha, M./ Premerlani, W.: "Observed Idiosyncracies of relational Database Design", IEEE Proc. of 2nd WCRE, Toronto, July, 1995, p. 116.
Premerleni, W./ Blaha, M.: "An Approach for Reengineering of relational databases", Comm. Of ACM, Vol. 37, No. 5, May, 1994, p. 42
Tayi, G.-K./Ballou, D.: "Examining Data Quality", Comm. Of ACM, Vol. 41, No. 2, Feb. 1998
Date, C.J.: An Introduction to Database Systems, Addison-Wesley Pub. , Reading Mass., 1975
Redman, T.C.: Data Quality for the Information Age, Artech House, Boston, 1996.
CW: Computer Weekly, No. 26, June 1998, p. 1
Yong, J.K./Kishore, R./Sanders, G.L.: "From DQ to EQ - Understanding Data Quality in the Context of E-Business Systems", Comm. Of ACM, Vol. 48, No. 10, Oct. 2005, p. 75
Blaha, M.: "A copper Bullet for Software Quality Improvement", IEEE Computer, Feb. 2004, p. 21
Wand, Y./Wang, R.: "Anchoring Data Quality Dimensions in Ontological Foundations", Comm. Of ACM, Vol. 39, No. 11, Nov. 1996, p. 86
Kaplan, D./Krishnan, R./Padman, R./Peters, J.: "Assessing Data Quality in Accounting Information Systems" in Comm. of ACM, Vol. 41, No. 2, Feb. 1998.
DeMarco, T.: Controlling Software Projects - Management, Measurement & Estimation, Yourdon Press, New York, 1982
Kan, S.H.: Metrics and Models in Software Quality Engineering, Addison-Wesley, Boston, 2001
International Function-Point User Group - IFPUG: Counting Practices Manual, Release 4.1. IFPUG, Westerville, Ohio, 1999
Sneed, H.: The Data-Point Method, Online, Journal of DP, No. 5, May 1990, p. 48.
Card, D., Agresti, W.: "Measuring Software Design Complexity", Journal of Systems & Software, Vol. 8, 1988, p. 185.
Hainaut, J-L.: "Strategies for Data Reengineering", IEEE Proc. of 9th WCRE, Richmond, Oct. 2002, p. 211.
Aiken, P.: Data Reverse Engineering, McGraw Hill, New York, 1996
Blaha, M.: A Manager's Guide to Database Technology - building and purchasing better Applications, Prentice-Hall, Englewood Cliffs, 2001
Orr, K.: "Data Quality and System Theory" in Comm. of ACM, Vol. 41, No. 2, Feb. 1998.
Jackson, M.: Principles of Program Design, Academic Press, London, 1975
Aiken, P./Muntz,A./Richards,R: "DOD Legacy Systems Reverse Engineering Data Requirements", Comm of ACM, Vol. 37, No. 5, Feb. 1994, p. 26.
Brathwaite, K.: Systems Design in a Database Environment, McGraw-Hill, New York, 1989, p. 106
Wang, R.: "Total Data Quality Management", in Comm. of ACM, Vol. 41, No. 2, Feb. 1998.
ISO/IEC: "Software Product Evaluation - Quality Characteristics and Guidelines for their Use" ISO/IEC Standard ISO-9126, Geneva, 1994
Howden, W.: "The Theory and Practice of Functional Testing", IEEE Software, Vol. 2, No. 5, Sept. 1985, p. 6
Fewster, M./Graham, D.: Software Test Automation, Addison-Wesley, Harlow, G.B., 1999
ISO/IEC: "Rule-based Validation - Schematron", ISO-IEC Standard 19757, International Standards Organization, Zurich, 2006
Sneed, H.: "Testing a Datawarehouse - an industrial challenge", IEEE Proc. of TAICPART Workshop, Windsor, G.B., August, 2006, p. 203.
Sneed, H./Baumgartner,M./Seidl,R.: Software in Zahlen, Hanser Verlag, Munich/Vienna, 2010, p. 197
Basili, V., Caldiera, C., Rombach, H-D.: "Goal Question Metric Paradigm", Encyclopedia of Software Engineering, Vol 1, John Wiley & Sons, New York, 1994, p. 5.