Data Science: Cross-Validating in Python

By John Paul Mueller, Luca Massaron

Cross-validating is easy with Python. If test sets can provide unstable results because of sampling in data science, the solution is to systematically sample a certain number of test sets and then average the results. It is a statistical approach (to observe many results and take an average of them), and that’s the basis of cross-validation. The recipe is straightforward:

  1. Divide your data into folds (each fold is a container that holds an even distribution of the cases), usually 10, but fold sizes of 3, 5, and 20 are viable alternative options.

  2. Hold out one fold as a test set and use the others as training sets.

  3. Train and record the test set result.

    If you have little data, it’s better to use a larger number of folds, because the quantity of data and the use of additional folds positively affects the quality of training.

  4. Perform Steps 2 and 3 again, using each fold in turn as a test set.

  5. Calculate the average and the standard deviation of all the folds’ test results.

    The average is a reliable estimator of the quality of your predictor. The standard deviation will tell you the predictor reliability (if it is too high, the cross-validation error could be imprecise). Expect that predictors with high variance will have a high cross-validation standard deviation.

Even though this technique may appear complicated, Scikit-learn handles it using a single class:

>>> from sklearn.cross_validation import cross_val_score

Using cross-validation on k folds

In order to run cross-validation, you first have to initialize an iterator. KFold is the iterator that implements k folds cross-validation. There are other iterators available from the sklearn.cross_validation module, mostly derived from the statistical practice, but KFolds is the most widely used in data science practice.

KFolds requires you to specify how many observations are in your sample (the n parameter), specify the n_folds number, and indicate whether you want to shuffle the data (by using the shuffle parameter). As a rule, the higher the expected variance, the more that increasing the number of folds can provide you a better mean estimate. It’s a good idea to shuffle the data because ordered data can introduce confusion into the learning processes if the first observations are different from the last ones.

After setting KFolds, call the cross_val_score function, which returns an array of results containing a score (from the scoring function) for each cross-validation fold. You have to provide cross_val_score with your data (both X and y) as an input, your estimator (the regression class), and the previously instantiated KFolds iterator (the cv parameter).

In a matter of a few seconds or minutes, depending on the number of folds and data processed, the function returns the results. You average these results to obtain a mean estimate, and you can also compute the standard deviation to check how stable the mean is.

crossvalidation = KFold(n=X.shape[0], n_folds=10,
 shuffle=True, random_state=1)
scores = cross_val_score(regression, X, y,
 scoring=‘mean_squared_error’, cv=crossvalidation,
 n_jobs=1)
print ‘Folds: %i, mean squared error: %.2f std: %.2f’
 %(len(scores),np.mean(np.abs(scores)),np.std(scores))
Folds: 10, mean squared error: 23.76 std: 12.13

Cross-validating can work in parallel because no estimate depends on any other estimate. You can take advantage of the multiple cores present on your computer by setting the parameter n_jobs=-1.

Sampling stratifications for complex data

Cross-validation folds are decided by random sampling. Sometimes it may be necessary to track if and how much of a certain characteristic is present in the training and test folds in order to avoid malformed samples. For instance, the Boston dataset has a binary variable (a feature that has a value of 1 or 0) indicating whether the house bounds the Charles River.

This information is important to understand the value of the house and determine whether people would like to spend more for it. You can see the effect of this variable using the following code.

import pandas as pd
df = pd.DataFrame(X, columns=boston.feature_names)
df[‘target’] = y
boxplot = df.boxplot(‘target’, by=‘CHAS’,
 return_type=‘axes’)

A boxplot reveals that houses on the river tend to have values higher than other houses. Of course, there are expensive houses all around Boston, but you have to keep an eye about how many river houses you are analyzing because your model has to be general for all of Boston, not just Charles River houses.

Boxplot of the target outcome, grouped by CHAS.

Boxplot of the target outcome, grouped by CHAS.

In similar situations, when a characteristic is rare or influential, you can’t be sure when it’s present in the sample because the folds are created in a random way. Having too many or too few of a particular characteristic in each fold implies that the machine-learning algorithm may derive incorrect rules.

The StratifiedKFold class provides a simple way to control the risk of building malformed samples during cross-validation procedures. It can control the sampling so that certain features, or even certain outcomes (when the target classes are extremely unbalanced), will always be present in your folds in the right proportion. You just need to point out the variable you want to control by using the y parameter.

from sklearn.cross_validation import StratifiedKFold
stratification = StratifiedKFold(y=X[:,3], n_folds=10,
 shuffle=True, random_state=1)
scores = cross_val_score(regression, X, y,
 scoring=‘mean_squared_error’, cv=stratification,
 n_jobs=1)
print ‘Stratified %i folds cross validation mean ‘ +
  ‘squared error: %.2f std: %.2f’ % (len(
   scores),np.mean(np.abs(scores)),np.std(scores))
Stratified 10 folds cross validation mean squared error:
 23.70 std: 6.10

Although the validation error is similar, by controlling the CHAR variable, the standard error of the estimates decreases, making you aware that the variable was influencing the previous cross-validation results.