By Alan Anderson, David Semmelroth

Most datasets come with some sort of metadata, which is essentially a description of the data in the file. Metadata typically includes descriptions of the formats, some indication of what values are in each data field, and what these values mean.

When you are faced with a new dataset, never take the metadata at face value. The very nature of big data requires that the systems that generate it are kept up and running as much as possible. For this reason, updating the metadata for these systems when changes are implemented is not always a top priority. You need to confirm that the data really is as the metadata claims.

Checking your sources

As obvious as it may sound, it is important that you have faith in where your data is coming from. This is particularly important when you are purchasing data. Thousands of vendors out there offer every imaginable kind of data. And they are not all of equal credibility.

Before purchasing data, try to understand exactly where and how the vendor is collecting it. Mysteriousness and vagueness are red flags.

Don’t take vendors at their word. Don’t rely solely on customer satisfaction postings on the website or client references provided by the vendor. If possible, try to track down someone who is using or has used the data.

If your data is coming from internal systems, it’s still important to evaluate the sources. Different systems have different purposes and therefore focus on different data. They may also collect data at different times.

For example, it is not uncommon for some hotel chains to book reservations in a separate system from the one they use at the front desk when the guest checks in. It is possible that the guest may receive a discounted offer between booking and check-in. This means that the room rate in the reservation system may not match the rate in the front desk system. What’s more, the reservation might get cancelled and never make it to the front desk!

Now, suppose you’re performing an analysis of hotel revenues by city. It’s rather important that you know that your room rate data is being sourced from the front desk system rather than the reservation system. But what if you’re trying to analyze how many reservations were generated by your company’s Super Bowl commercial? In this case, you want to see data from the reservation system.

The hotel example illustrates that even intrinsically clean data can be problematic. Even if data is accurate and exactly what it purports to be, timing can be an issue. Data changes over time.

Verifying formats

As mentioned earlier in this chapter, one of the things that your metadata will provide for you is some indication of how the data is formatted. By formatted, we mean how each particular data element looks. Is “Product Code” a character or numeral? Is “Start Date” a date or is it really a datetime stamp?

Data types are important in statistical analysis because they dictate which statistics and statistical procedures can be applied to which data elements. If you try to take the average value of a character field like “First Name,” you’re going to get an error message every time.

Typically, this type of metadata is pretty accurate. It is generally stored by the system that holds the data and can be generated automatically. Verifying the formats is generally pretty straightforward. Such verification is essentially a by-product of the validation of data ranges discussed in the following section. But there are instances where it can be a bit more difficult.

We’ve seen one such scenario more times than we care to recall. It happens sometimes that when a system is first designed, the development team tries to put some flexibility into the data structures to accommodate future enhancements. Sometimes they just add a bunch of empty (and wide) alpha-numeric data columns onto the end of each record. These auxiliary columns are initially not used for anything.

Analysts will always err on the side of asking for more data rather than less — frequently, all data rather than some. This fact, combined with the need to get the data quickly, sometimes results in a data dump. This dump generally includes the auxiliary columns. In these cases, the metadata tells you something like “Fields 1–11” are formatted as “200 alphanumeric characters.”

Such information is practically useless. To make sense of a data field like this, you pretty much have to get your hands dirty. There’s not a lot you can do except page through a few dozen records and try to make an informed guess about what’s actually in the field. In most cases, these fields tend to be empty. But not always. The good news is that if the field is actually being used, you should be able to find a programmer somewhere who knows what it’s being used for.

Typecasting your data

One of the most critical steps in performing a statistical analysis is making sure that your data is what it purports to be. Statistical procedures will invariably crash if you don’t provide them with valid information about data formats. But these procedures are largely blind to problems with the validity of the data.

Understanding how a data field is formatted is not enough. Before turning a dataset over to a statistical procedure, you need to understand what the data actually is in each of the fields you are using.

Most data falls into one of four categories: nominal, ordinal, interval, and ratio. The data type determines what sort of statistics and statistical procedures can be applied to particular data fields. You can’t take an average of a field like “Last Name,” for example.

Confusing data types with data formats is easy (and far too common). Knowing whether a data field is a character, integer, or continuous does not tell you the data type.

Character fields are sometimes used as placeholders for data that might be captured in future releases of a system. There is nothing to prevent such a field from being used to capture monetary or other numeric data.

The most common data-type mistake involves assuming that a numeric field, particularly an integer-valued field, actually contains numeric ordinal data. It is extremely common for companies to use numeric codes (nominal data) to represent products, regions, stores, and various other entities.

Airline flight codes are one example. Census regions are another. Even credit card and Social Security numbers are typically stored as integers. But all of these entities are merely identifiers. They are nominal variables. The average credit card number in a bank’s portfolio is a meaningless statistic.