How to Use Python to Select the Right Variables for Data Science

By John Paul Mueller, Luca Massaron

Selecting the right variables in Python can improve the learning process in data science by reducing the amount of noise (useless information) that can influence the learner’s estimates. Variable selection, therefore, can effectively reduce the variance of predictions. In order to involve just the useful variables in training and leave out the redundant ones, you can use these techniques:

  • Univariate approach: Select the variables most related to the target outcome.

  • Greedy or backward approach: Keep only the variables that you can remove from the learning process without damaging its performance.

Selecting by univariate measures

If you decide to select a variable by its level of association with its target, the class SelectPercentile provides an automatic procedure for keeping only a certain percentage of the best, associated features. The available metrics for association are

  • f_regression: Used only for numeric targets and based on linear regression performance.

  • f_classif: Used only for categorical targets and based on the Analysis of Variance (ANOVA) statistical test.

  • chi2: Performs the chi-square statistic for categorical targets, which is less sensible to the nonlinear relationship between the predictive variable and its target.

When evaluating candidates for a classification problem, f_classif and chi2 tend to provide the same set of top variables. It’s still a good practice to test the selections from both the association metrics.

Apart from applying a direct selection of the top percentile associations, SelectPercentile can also rank the best variables to make it easier to decide at what percentile to exclude a feature from participating in the learning process. The class SelectKBest is analogous in its functionality, but it selects the top k variables, where k is a number, not a percentile.

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression
Selector_f = SelectPercentile(f_regression, percentile=25)
Selector_f.fit(X,y)
for n,s in zip(boston.feature_names,Selector_f.scores_):
 print ‘F-score: %3.2ft for feature %s ‘ % (s,n)
F-score: 88.15  for feature CRIM
F-score: 75.26  for feature ZN
F-score: 153.95 for feature INDUS
F-score: 15.97  for feature CHAS
F-score: 112.59 for feature NOX
F-score: 471.85 for feature RM
F-score: 83.48  for feature AGE
F-score: 33.58  for feature DIS
F-score: 85.91  for feature RAD
F-score: 141.76 for feature TAX
F-score: 175.11 for feature PTRATIO
F-score: 63.05  for feature B
F-score: 601.62 for feature LSTAT

Using the level of association output helps you to choose the most important variables for your machine-learning model, but you should watch out for these possible problems:

  • Some variables with high association could also be highly correlated, introducing duplicated information, which acts as noise in the learning process.

  • Some variables may be penalized, especially binary ones (variables indicating a status or characteristic using the value 1 when it is present, 0 when it is not). For example, notice that the output shows the binary variable CHAS as the least associated with the target variable (but you know from previous examples that it’s influential from the cross-validation phase).

The univariate selection process can give you a real advantage when you have a huge number of variables to select from and all other methods turn computationally infeasible. The best procedure is to reduce the value of SelectPercentile by half or more of the available variables, reduce the number of variables to a manageable number, and consequently allow the use of a more sophisticated and more precise method such as a greedy search.

Using a greedy search

When using a univariate selection, you have to decide for yourself how many variables to keep: Greedy selection automatically reduces the number of features involved in a learning model on the basis of their effective contribution to the performance measured by the error measure.

The RFECV class, fitting the data, can provide you with information on the number of useful features, point them out to you, and automatically transform the X data, by the method transform, into a reduced variable set, as shown in the following example:

from sklearn.feature_selection import RFECV
selector = RFECV(estimator=regression, cv=10,
 scoring=‘mean_squared_error’)
selector.fit(X, y)
print(“Optimal number of features: %d”
 % selector.n_features_)
Optimal number of features: 6

It’s possible to obtain an index to the optimum variable set by calling the attribute support_ from the RFECV class after you fit it.

print boston.feature_names[selector.support_]
[‘CHAS’ ‘NOX’ ‘RM’ ‘DIS’ ‘PTRATIO’ ‘LSTAT’]

Notice that CHAS is now included among the most predictive features, which contrasts with the result from the univariate search. The RFECV method can detect whether a variable is important, no matter whether it is binary, categorical, or numeric, because it directly evaluates the role played by the feature in the prediction.

The RFECV method is certainly more efficient, when compared to the ­univariate approach, because it considers highly correlated features and is tuned to optimize the evaluation measure (which usually is not Chi-square or F-score). Being a greedy process, it’s computationally demanding and may only approximate the best set of predictors.

As RFECV learns the best set of variables from data, the selection may overfit, which is what happens with all other machine-learning algorithms. Trying RFECV on different samples of the training data can confirm the best variables to use.