Data Science: Using Python to Modify Data Distributions

By John Paul Mueller, Luca Massaron

Python allows data scientists to modify data distributions as part of the EDA approach. As a by-product of data exploration, in an EDA phase you can do the following things:

  • Obtain new feature creation from the combination of different but related variables

  • Spot hidden groups or strange values lurking in your data

  • Try some useful modifications of your data distributions by binning (or other discretizations such as binary variables)

When performing EDA, you need to consider the importance of data transformation in preparation for the learning phase, which also means using certain mathematical formulas. The following information provides an overview of the most common mathematical formulas used for EDA. The data transformation you choose depends on the distribution of your data, with a normal distribution being the most common. In addition, you see the need to match the transformation process to the mathematical formula you use.

Using the normal distribution

The normal, or Gaussian, distribution is the most useful distribution in statistics thanks to its frequent recurrence and particular mathematical properties. It’s essentially the foundation of many statistical tests and models, with some of them, such as the linear regression, widely used in data science.

During data science practice, you’ll meet with a wide range of different distributions — with some of them named by probabilistic theory, others not. For some distributions, the assumption that they should behave as a normal distribution may hold, but for others, it may not, and that could be a problem depending on what algorithms you use for the learning process.

As a general rule, if your model is a linear regression or part of the linear model family because it boils down to a summation of coefficients, then both variable standardization and distribution transformation should be considered.

Creating a Z-score standardization

In your EDA process, you may have realized that your variables have different scales and are heterogeneous in their distributions. As a consequence of your analysis, you need to transform the variables in a way that makes them easily comparable:

from sklearn.preprocessing import scale
stand_sepal_width = scale(iris_dataframe[‘sepal width (cm)’])

Transforming other notable distributions

When you check variables with high skewness and kurtosis for their correlation, the results may disappoint you. Using a nonparametric measure of correlation, such as Spearman’s, may tell you more about two variables than Pearson’s r may tell you. In this case, you should transform your insight into a new, transformed feature:

tranformations = {‘x’: lambda x: x, ‘1/x’: lambda x: 1/x, ‘x**2’: lambda x: x**2,
 ‘x**3’: lambda x: x**3, ‘log(x)’: lambda x: np.log(x)}
for transformation in tranformations:
 pearsonr_coef, pearsonr_p = pearsonr(iris_dataframe[‘sepal length (cm)’],
  tranformations[transformation](iris_dataframe[‘sepal width (cm)’]))
 print ‘Transformation: %s t Pearson’s r: %0.3f’ % (transformation, pearsonr_coef)
Transformation: x   Pearson’s r: -0.109
Transformation: x**2  Pearson’s r: -0.122
Transformation: x**3  Pearson’s r: -0.131
Transformation: log(x) Pearson’s r: -0.093
Transformation: 1/x  Pearson’s r: 0.073

In exploring various possible transformations, using a for loop may tell you that a power transformation will increase the correlation between the two variables, thus increasing the performance of a linear machine-learning algorithm. You may also try other, further transformations such as square root np.sqrt(x), exponential np.exp(x), and various combinations of all the transformations, such as log inverse np.log(1/x).