Data Science: Using Python to Count for Categorical Data - dummies

Data Science: Using Python to Count for Categorical Data

By John Paul Mueller, Luca Massaron

Categorical data and Python are a data scientist’s friends. The Iris dataset is made of four metric variables and a qualitative target outcome. Just as you use means and variance as descriptive measures for metric variables, so do frequencies strictly relate to qualitative ones.

Because the dataset is made up of metric measurements (width and lengths in centimeters), you must render it qualitative by dividing it into bins according to specific intervals. The pandas package features two useful functions, cut and qcut, that can transform a metric variable into a qualitative one:

  • cut expects a series of edge values used to cut the measurements or an integer number of groups used to cut the variables into equal-width bins.

  • qcut expects a series of percentiles used to cut the variable.

You can obtain a new categorical DataFrame using the following command, which concatenates a binning for each variable:

iris_binned = pd.concat([
pd.qcut(iris_dataframe.ix[:,0], [0, .25, .5, .75, 1]),
pd.qcut(iris_dataframe.ix[:,1], [0, .25, .5, .75, 1]),
pd.qcut(iris_dataframe.ix[:,2], [0, .25, .5, .75, 1]),
pd.qcut(iris_dataframe.ix[:,3], [0, .25, .5, .75, 1]),
], join=‘outer’, axis = 1)

This example relies on binning. However, it could also help to explore when the variable is above or below a singular hurdle value, usually the mean or the median. In this case, you set pd.qcut to the 0.5 percentile or pd.cut to the mean value of the variable.

Binning transforms numerical variables into categorical ones. This transformation can improve your understanding of data and the machine-learning phase that follows by reducing the noise (outliers) or nonlinearity of the transformed variable.

Understanding frequencies

You can obtain a frequency for each categorical variable of the dataset, both for the predictive variable and for the outcome, by using the following code:

print iris_dataframe[‘group’].value_counts()
virginica  50
versicolor 50
setosa  50
print iris_binned[‘petal length (cm)’].value_counts()
[1, 1.6]  44
(4.35, 5.1] 41
(5.1, 6.9]  34
(1.6, 4.35] 31

This example provides you with some basic frequency information as well, such as the number of unique values in each variable and the mode of the frequency (top and freq rows in the output).

print iris_binned.describe()
  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count    150    150    150    150
unique     4    4     4    4
top   [4.3, 5.1]   [2, 2.8]   [1, 1.6]  [0.1, 0.3]
freq     41    47    44    41

Frequencies can signal a number of interesting characteristics of qualitative features:

  • The mode of the frequency distribution that is the most frequent category

  • The other most frequent categories, especially when they are comparable with the mode (bimodal distribution) or if there is a large difference between them

  • The distribution of frequencies among categories, if rapidly decreasing or equally distributed

  • Rare categories that gather together

Creating contingency tables

By matching different categorical frequency distributions, you can display the relationship between qualitative variables. The pandas.crosstab function can match variables or groups of variables, helping to locate possible data structures or relationships.

In the following example, you check how the outcome variable is related to petal length and observe how certain outcomes and petal binned classes never appear together:

print pd.crosstab(iris_dataframe[‘group’], iris_binned[‘petal length (cm)’])
petal length (cm) (1.6, 4.35] (4.35, 5.1] (5.1, 6.9] [1, 1.6]
setosa      6   0   0  44
versicolor     25   25   0   0
virginica     0   16   34   0

The pandas.crosstab function ignores categorical variable ordering and always displays the row and column categories according to their alphabetical order. This nuisance is still present in the pandas version 0.15.2, but it may be resolved in the future.