Dealing with Quality Issues in Your Customer-Related Data
As you dig into your customer data to build marketing campaigns or to look for insights into your customer base, you’ll inevitably be confronted with poor data quality, sometimes called dirty data. Your customer database is built using data feeds from a variety of different sources, including internal company systems as well as third-party data providers. The data from these systems varies in quality, freshness, and consistency.
Names and addresses can be formatted differently in different systems. And they change over time, which is why keeping them updated is so important.
But that sort of variation in data isn’t unique to customer contact information. Your database is full of potential data-quality landmines. By thinking critically about the data you’re analyzing, you can avoid inaccurate or misleading interpretation of your customer data.
Take out the (data) garbage
The ideal approach is to thoroughly analyze your data before it’s even loaded to your database. That way you can correct some problems before they can mess up your analysis. In some cases, you may need to actually reject data from your database if it doesn’t meet certain quality standards.
You need to carefully analyze every individual data element. Check that it’s in the correct format, certainly. But also check that each value actually makes sense. These formats and reasonable values are different for each data element.
For example, there is clearly a problem if you have a whole bunch of customers whose age is listed as 99. That’s probably an indication that the customer age was unavailable, and 99 was a default value in the source system. This sort of thing happens frequently with cash register data related to alcohol purchases. Cashiers are forced to enter a birthdate to complete the transaction, but they often don’t even ask the question. They simply key in some default value.
Age is something that you might want to use to take averages or do other calculations. The fact that this default value looks superficially like an actual age means it can screw up those calculations. In this case, it’s a much better idea to replace the default value with a missing value or null value. Virtually all database engines allow this type of value.
Most reporting engines and some statistical modeling techniques are quite adept at dealing with these null values. Some statistical techniques, though, require you to replace those missing values with some average value or exclude the records altogether. In either case, knowing that value is missing is better than mistakenly assuming the value is meaningful.
Be careful of stale data
Customer age is a good example of another issue you may encounter. Time marches on, as the saying goes. If you loaded a customer record 5 years ago that showed the customer as being 35 years old, that information is no longer accurate. Your first inclination might be to solve the issue by building a process to update customer ages each year. That works in theory, but in practice it can get rather tricky. Each time you update the age, you have to look at where the data came from and when.
A better approach is to store the data in such a way that the problem goes away. You can easily convert the age to a birth year when you initially load the record. You then create another data element which essentially contains the value current year minus birth year. Voilá — no maintenance necessary. The database engine always knows what year it is, so this age value will never need to be updated.
By thinking critically about your data before it gets loaded into your database, you can avoid many data quality problems. But don’t get complacent. Source systems change over time, and these changes cause downstream changes in the way data is passed to your customer database. It’s a good idea to proactively audit your data occasionally to check for suspicious data.