Resorting to Cross-Validation in Machine Learning - dummies

Resorting to Cross-Validation in Machine Learning

By John Paul Mueller, Luca Massaron

Sometimes, machine learning requires that you will need to resort to cross-validation. A noticeable problem with the train/test set split is that you’re actually introducing bias into your testing because you’re reducing the size of your in-sample training data. When you split your data, you may be actually keeping some useful examples out of training. Moreover, sometimes your data is so complex that a test set, though apparently similar to the training set, is not really similar because combinations of values are different (which is typical of highly dimensional datasets).

These issues add to the instability of sampling results when you don’t have many examples. The risk of splitting your data in an unfavorable way also explains why the train/test split isn’t the favored solution by machine learning practitioners when you have to evaluate and tune a machine learning solution.

Cross-validation based on k-folds is actually the answer. It relies on random splitting, but this time it splits your data into a number k of folds (portions of your data) of equal size. Then, each fold is held out in turn as a test set and the others are used for training. Each iteration uses a different fold as a test, which produces an error estimate.

In fact, after completing the test on one fold against the others used as training, a successive fold, different from the previous, is held out and the procedure is repeated in order to produce another error estimate. The process continues until all the k-folds are used once as a test set and you have a k number of error estimates that you can compute into a mean error estimate (the cross-validation score) and a standard error of the estimates.

cross-validation
A graphical representation of how cross-validation works.

This procedure provides the following advantages:

  • It works well regardless of the number of examples, because by increasing the number of used folds, you are actually increasing the size of your training set (larger k, larger training set, reduced bias) and decreasing the size of the test set.
  • Differences in distribution for individual folds don’t matter as much. When a fold has a different distribution compared to the others, it’s used just once as a test set and is blended with others as part of the training set during the remaining tests.
  • You are actually testing all the observations, so you are fully testing your machine learning hypothesis using all the data you have.
  • By taking the mean of the results, you can expect a predictive performance. In addition, the standard deviation of the results can tell you how much variation you can expect in real out-of-sample data. Higher variation in the cross-validated performances informs you of extremely variegated data that the algorithm is incapable of properly catching.

Using k-fold cross-validation is always the optimal choice unless the data you’re using has some kind of order that matters. For instance, it could involve a time series, such as sales. In that case, you shouldn’t use a random sampling method but instead rely on a train/test split based on the original sequence so that the order is preserved and you can test on the last examples of that ordered series.