Training, Validating, and Testing in Machine Learning

By John Paul Mueller, Luca Massaron

In a perfect world, you could perform a test on data that your machine learning algorithm has never learned from before. However, waiting for fresh data isn’t always feasible in terms of time and costs.

As a first simple remedy, you can randomly split your data into training and test sets. The common split is from 25 to 30 percent for testing and the remaining 75 to 70 percent for training. You split your data consisting of your response and features at the same time, keeping correspondence between each response and its features.

The second remedy occurs when you need to tune your learning algorithm. In this case, the test split data isn’t a good practice because it causes another kind of overfitting called snooping. To overcome snooping, you need a third split, called a validation set. A suggested split is to have your examples partitioned in thirds: 70 percent for training, 20 percent for validation, and 10 percent for testing.

You should perform the split randomly, that is, regardless of the initial ordering of the data. Otherwise, your test won’t be reliable, because ordering could cause overestimation (when there is some meaningful ordering) or underestimation (when distribution differs by too much). As a solution, you must ensure that the test set distribution isn’t very different from the training distribution, and that sequential ordering occurs in the split data.

For example, check whether identification numbers, when available, are continuous in your sets. Sometimes, even if you strictly abide by random sampling, you can’t always obtain similar distributions among sets, especially when your number of examples is small.

When your number of examples n is high, such as n>10,000, you can quite confidently create a randomly split dataset. When the dataset is smaller, comparing basic statistics such as mean, mode, median, and variance across the response and features in the training and test sets will help you understand whether the test set is unsuitable. When you aren’t sure that the split is right, just recalculate a new one.