Data Science: How to Deal with Missing Data in Python

By John Paul Mueller, Luca Massaron

You can use Python to deal with that missing information that sometimes pops up in data science. Sometimes the data you receive is missing information in specific fields. For example, a customer record might be missing an age. If enough records are missing entries, any analysis you perform will be skewed and the results of the analysis weighted in an unpredictable manner. Having a strategy for dealing with missing data is important.

Finding the missing data

It’s essential to find missing data in your dataset to avoid getting incorrect results from your analysis. The following code shows how you could obtain a listing of missing values without too much effort.

import pandas as pd
import numpy as np
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
print s.isnull()
print
print s[s.isnull()]

A dataset could represent missing data in several ways. In this example, you see missing data represented as np.NaN (NumPy Not a Number) and the Python None value.

Use the isnull() method to detect the missing values. The output shows True when the value is missing. By adding an index into the dataset, you obtain just the entries that are missing. The example shows the following output:

0 False
1 False
2 False
3  True
4 False
5 False
6  True
dtype: bool
3 NaN
6 NaN
dtype: float64

Encoding missingness

After you figure out that your dataset is missing information, you need to consider what to do about it. The three possibilities are to ignore the issue, fill in the missing items, or remove (drop) the missing entries from the dataset. Ignoring the problem could lead to all sorts of problems for your analysis, so it’s the option you use least often. The following example shows one technique for filling in missing data or dropping the errant entries from the dataset:

import pandas as pd
import numpy as np
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
print s.fillna(int(s.mean()))
print
print s.dropna()

The two methods of interest are fillna(), which fills in the missing entries, and dropna(), which drops the missing entries. When using fillna(), you must provide a value to use for the missing data. This example uses the mean of all the values, but you could choose a number of other approaches. Here’s the output from this example:

0 1
1 2
2 3
3 3
4 5
5 6
6 3
dtype: float64
0 1
1 2
2 3
4 5
5 6
dtype: float64

Working with a series is straightforward because the dataset is so simple. When working with a DataFrame, however, the problem becomes significantly more complicated. You still have the option of dropping the entire row. When a column is sparsely populated, you might drop the column instead. Filling in the data also becomes more complex because you must consider the dataset as a whole, in addition to the needs of the individual feature.

Imputing missing data

The previous information hints at the process of imputing missing data (ascribing characteristics based on how the data is used). The technique you use depends on the sort of data you’re working with.

For example, when working with a tree ensemble, you may simply replace missing values with a –1 and rely on the imputer (a transformer algorithm used to complete missing values) to define the best possible value for the missing data. The following example shows a technique you can use to impute missing data values:

import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
imp = Imputer(missing_values=‘NaN’,
    strategy=‘mean’, axis=0)
imp.fit([1, 2, 3, 4, 5, 6, 7])
x = pd.Series(imp.transform(s).tolist()[0])
print x

In this example, s is missing some values. The code creates an Imputer to replace these missing values. The missing_values parameter defines what to look for, which is NaN. You set the axis parameter to 0 to impute along columns and 1 to impute along rows. The strategy parameter defines how to replace the missing values:

  • mean: Replaces the values by using the mean along the axis

  • median: Replaces the values by using the medium along the axis

  • most_frequent: Replaces the values by using the most frequent value along the axis

Before you can impute anything, you must provide statistics for the Imputer to use by calling fit(). The code then calls transform() on s to fill in the missing values. However, the output is no longer a series. To create a series, you must convert the Imputer output to a list and use the resulting list as input to Series(). Here’s the result of the process with the missing values filled in:

0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: float64