Identifying Missing Data for Machine Learning - dummies

Identifying Missing Data for Machine Learning

By John Paul Mueller, Luca Massaron

Even if you have enough examples at hand for training both simple and complex machine learning algorithms, they must present complete values in the features, without any missing data. Having an incomplete example makes connecting all the signals within and between features impossible. Missing values also make it difficult for the algorithm to learn during training. You must do something about the missing data.

Most often, you can ignore missing values or repair them by guessing a likely replacement value. However, too many missing values render more uncertain predictions because missing information could conceal any possible figure; consequently, the more missing values in the features, the more variable and imprecise the predictions.

As a first step, count the number of missing cases in each variable. When a variable has too many missing cases, you may need to drop it from the training and test dataset. A good rule of thumb is to drop a variable if more than 90 percent of its instances are missing.

Some learning algorithms do not know how to deal with missing values and report errors in both training and test phases, whereas other models treat them as zero values, causing an underestimation of the predicted value or probability (it’s just as if part of the formula isn’t working properly). Consequently, you need to replace all the missing values in your data matrix with some suitable value for machine learning to happen correctly.

Many reasons exist for missing data, but the essential point is whether the data is missing randomly or in a specific order. Random missing data is ideal because you can guess its value using a simple average, a median, or another machine learning algorithm, without too many concerns. Some cases contain a strong bias toward certain kinds of examples.

For instance, think of the case of studying the income of a population. Wealthy people (for taxation reasons, presumably) tend to hide their true income by reporting to you that they don’t know. Poor people, on the other hand, may say that they don’t want to report their income for fear of negative judgment. If you miss information from certain strata of the population, repairing the missing data can be difficult and misleading because you may think that such cases are just like the others.

Instead, they are quite different. Therefore, you can’t simply use average values to replace the missing values — you must use complex approaches and tune them carefully. Moreover, identifying cases that aren’t missing data at random is difficult because it requires a closer inspection of how missing values are associated with other variables in the dataset.

When data is missing at random, you can easily repair the empty values because you obtain hints to their true value from other variables. When data isn’t missing at random, you can’t get good hints from other available information unless you understand the data association with the missing case.

Therefore, if you have to figure out missing income in your data, and it is missing because the person is wealthy, you can’t replace the missing value with a simple average because you’ll replace it with a medium income. Instead, you should use an average of the income of wealthy people as a replacement.

When data isn’t missing at random, the fact that the value is missing is informative because it helps track down the missing group. You can leave the chore of looking for the reason that it’s missing to your machine learning algorithm by building a new binary feature that reports when the value of a variable is missing. Consequently, the machine learning algorithm will figure out the best value to use as a replacement by itself.