Middleware Services: Data Quality Assurance
You should establish two different quality assurance (QA) services in the flow of middleware services. You have to perform the first QA tasks against the extract from the data source before you perform any more middleware services.
Data quality assurance: part I
Try to catch (and correct) errors and problems as early in the process as possible. Moving data down the pipeline toward the data warehouse is pointless if problems are so significant that they either require significantly more effort to correct later in the process or simply can’t be corrected.
So, what types of problems should you look for? Here are a few:
Values in data elements that exceed a reasonable range: A customer has submitted 150 million purchase orders in the past month, for example, or an employee has worked with the company for 4,297 years, according to the employee database and the stored hiring date.
Values in data elements that don’t fit the official and complete list of permissible values: A value might have an A code, for example, when the only permissible values for that field are M and F. (If that field were labeled GENDER, A might stand for androgynous!)
Cross-table inconsistencies: For entries in the CUSTOMER_ORDER table, no corresponding entries (as identified by CUSTOMER_ID) exist in the CUSTOMER_MASTER_TABLE.
Cross-field inconsistencies: Records that have an incorrect state or zip code for the city indicated.
Missing values: Records that have missing values in certain fields where they should have contents.
Data gaps: For example, a source table should contain one row of data that includes total units sold and sales dollars for each month over the past two years. For a large number of customers, however, no rows exist for at least one of those months.
Incomplete data: If information about every product the company sells is supposed to be available, for example, are all products included in the extract?
Violations of business rules: If a business rule states that only one wholesaler can sell products to any one of the company’s customers, you should check to see whether any customer records indicate sales made through more than one wholesaler, which could indicate incorrect data in the source.
Data corruption since the last extract: If extraction occurs monthly, for example, you should keep track of data values or sums that should be constant, such as SALES PER CUSTOMER PER MONTH. If, in a subsequent month, the value of SALES PER CUSTOMER PER MONTH changes for a given customer for a previous month, the underlying data might have been corrupted.
Spelling inconsistencies: A customer’s name is spelled several different ways, for example.
What do you do when you find problems? You can try one of the following techniques:
Apply an automatic-correction rule. When you find an inconsistent spelling, for example, do a lookup in a master table of previous spelling corrections and automatically make the change in the data.
Set aside the record for a team member to analyze and correct later. In this case, you might do the human part of the QA in conjunction with automatic correction.
For example, automatic corrections are made, if possible, and a report about other problems are put into a separate file and sent to the QA person. When the QA person makes all the manual corrections, you merge the corrections back into the data that has gone through the automatic QA process.
Cool your jets. If you discover enough problems that are serious or require an indeterminate amount of research, consider halting the entire process until after you find and fix the problem.
You can make the QA process much more efficient, and much less problematic, if you perform a thorough source systems analysis. If you have a fairly good idea about what types of data problems you might find in each data source, you can reprogram your QA process to detect and (hopefully) correct those problems before continuing.
Historically, organizations treated the data warehouse QA process as a one-directional flow. Problems are corrected before the data is moved further into the flow of middleware processes but is never corrected in the data sources. Most new data warehouses have a built-in feedback loop from the QA process that corrects data quality issues in the source data.
Data quality assurance: part II
Following completion of the transformation processes, data must be QA’d — again. You never know what type of errors or discrepancies the transformation process might have introduced into the data. After changes have occurred, any previous QA processes are no longer valid.
Run the consolidated, transformed data through the same type of QA steps discussed here. Although you probably don’t find as many rudimentary errors (such as spelling mistakes or values that are out of range) if you did a thorough job on your first-level QA, you still want to make sure. Furthermore, ensure that the code or scripts used for data transformation didn’t accidentally cause new errors to creep in.
The goal of this second-level QA is to make sure that your consolidated and transformed data is ready to load into the data warehouse — as soon as one more step occurs, if necessary.