Ensure the Quality of Incoming External Data
When you are designing a data warehouse and you determine what external data you need, you just place an order (similar to ordering clothes or a fruit basket from an online site). After you begin receiving data via a stream, file transfer, or some other means, it’s smooth sailing — or is it?
What about the quality of the incoming data? You absolutely must apply the same set of quality assurance (QA) procedures to externally provided data that you do to data coming from your own internal systems. Just because you purchase the information on the open market doesn’t guarantee that the data is flawless.
Apply QA procedures to every incoming batch of data by following these steps:
Find out whether the incoming data has check values appended to the files.
Some examples of check values are the number of records in each file, the total value of each numeric column (total sales dollars for all records and total units sold for all records, for example), and subsets of the total column values (total amounts of sales and units by state, for example).
If check values are provided, they must be stored and used as part of the end-to-end loading procedures. No one should officially update the warehouse’s contents until the check totals agree with the calculations that you made when you prepared the data for loading.
If no check values are provided, request them.
Although the request might take a few cycles (a few weeks or months, for example) to fill, any data provider interested in providing a high level of customer service takes this type of request seriously and strives to make the requested control information available.
During your loading procedures, filter each row.
Make sure that the following conditions are true:
Keys (unique identifiers for each record) are correct across all the information. For example, if each record in the SalesMasterRecord group of data must have exactly 12 related records in SalesDetailRecord (one for each month), make sure that all the detail records are present by comparing record key values.
Ranges of values are correct. Product sales per month, for example, must be within reasonable bounds for that type of product (airplanes are different from bolts, for example).
Missing fields of information (a likely — almost inevitable — occurrence with externally provided data) don’t distort the meaning of the incoming data.
For example, although the absence of supplemental pieces of data (defined according to the business rules for your specific industry or organization) might not be too serious a problem, if half the incoming records have an empty space where UnitsSold, TotalSalesPrice, or some other critical type of information should be, the value of the data is questionable at best.
Especially in the early stages of acquiring external data (the first three or four months, for example), use your analytical tools, as described in Chapter 10, to perform data quality analysis before your users use the same tools to perform business analysis.
Search for oddities, anomalies, puzzling results, inconsistencies, apparent paradoxes, and anything else that just looks weird. Then, drill down to the roots of the data to check for the source of the weirdness.
Remember that you’re probably dealing with many millions of rows of incoming data: In addition to not being able to personally check out every single row, you might have difficulty setting up your filtering and QA checking criteria for every possible condition.
Anyone who has ever done anything with externally provided source data has come across all kinds of strange inconsistencies and missing data in the incoming information. By putting yourself in the place of users and using the same tools they use, you can probably discover a thing or two that you can correct, making your data warehouse a much better store of valuable business information.