By Alan Anderson, David Semmelroth

One of the most frequent and messiest data problems to deal with is missing data. Files can be incomplete because records were dropped or a storage device filled up. Or certain data fields might contain no data for some records. The first of these problems can be diagnosed by simply verifying record counts for files. The second problem is more difficult to deal with.

To put it in simple terms, when you find a field containing missing values, you have two choices:

  • Ignore it.

  • Stick something in the field.

Ignoring the problem

In some cases, you may simply find a single field with a large number of missing values. If so, the easiest thing to do is just ignore the field. Don’t include it in your analysis.

Another way to ignore the problem is to ignore the record. Simply delete the record containing the missing data. This may make sense if there are only a few rogue records. But if there are multiple data fields containing significant numbers of missing values, this approach may shrink your record count to an unacceptable level.

Another thing to look out for before simply deleting records is any sign of a pattern. For example, suppose you are analyzing a dataset related to credit card balances nationwide. You may well find a whole bunch of records showing $0.00 balances (perhaps around half the records). This is not in itself an indication of missing data. However, if all the records from, say, California are showing $0.00 balances, that indicates a potential missing values problem. And it’s not one that would be usefully solved by deleting all the records from the largest state in the country. In this case, it is probably a systems issue and indicates that a new file should be created.

In general, deleting records is an easy, but not ideal, solution to missing-value problems. If the problem is relatively small and there is no discernible pattern to the omissions, then it may be okay to jettison the offending records and move on. But frequently a more highbrow approach is warranted.

Filling in the missing data

Filling in the missing data amounts to making an educated guess about what would have been in that field. There are good and bad ways to do this. One simple (but bad) approach is to replace the missing values with the average of the non-missing ones. In non-numeric fields, you might be tempted to populate the missing records with the most common value in the other records (the mode).

These approaches are, unfortunately, still frequently used in some business applications. But they are widely regarded by statisticians as bad ideas. For one thing, the whole point of doing statistical analysis is to find data that differentiates one result from another. By replacing all the missing records with the same value, you haven’t differentiated anything.

The more highbrow approach is to try to find a way to predict in a meaningful way what value should be filled in on each record that is missing a value. This involves looking at the complete records and trying to find clues as to what the missing value might be.

Suppose you are analyzing a demographic file to predict likely purchasers of one of your products. In that file you have, among other fields, information on marital status, number of children, and number of automobiles. For some reason, the number of autos field is missing in one-third of the records.

By analyzing the other two fields — marital status and number of children — you may discover some patterns. Single people tend to have one car. Married people with no children tend to have two cars. Married people with more than one child might be more likely to have three cars. In this way, you can guess at the missing values in a way that actually differentiates the records. More on this approach to come.

There is a general term in statistics and data processing that refers to questionable data. The term noisy is used to describe data that is unreliable, corrupt, or otherwise less than pristine. Missing data is but one example of this. A detailed description of techniques for cleaning up noisy data in general is beyond the scope of this book. In fact, this is an active area of research in statistical theory. The fact that all noise is not as easy to spot as missing values makes it troublesome to deal with.