Machine Learning For Dummies
Book image
Explore Book Buy On Amazon
In a perfect world, you could perform a test on data that your machine learning algorithm has never learned from before. However, waiting for fresh data isn’t always feasible in terms of time and costs.

As a first simple remedy, you can randomly split your data into training and test sets. The common split is from 25 to 30 percent for testing and the remaining 75 to 70 percent for training. You split your data consisting of your response and features at the same time, keeping correspondence between each response and its features.

The second remedy occurs when you need to tune your learning algorithm. In this case, the test split data isn’t a good practice because it causes another kind of overfitting called snooping. To overcome snooping, you need a third split, called a validation set. A suggested split is to have your examples partitioned in thirds: 70 percent for training, 20 percent for validation, and 10 percent for testing.

You should perform the split randomly, that is, regardless of the initial ordering of the data. Otherwise, your test won’t be reliable, because ordering could cause overestimation (when there is some meaningful ordering) or underestimation (when distribution differs by too much). As a solution, you must ensure that the test set distribution isn’t very different from the training distribution, and that sequential ordering occurs in the split data.

For example, check whether identification numbers, when available, are continuous in your sets. Sometimes, even if you strictly abide by random sampling, you can’t always obtain similar distributions among sets, especially when your number of examples is small.

When your number of examples n is high, such as n>10,000, you can quite confidently create a randomly split dataset. When the dataset is smaller, comparing basic statistics such as mean, mode, median, and variance across the response and features in the training and test sets will help you understand whether the test set is unsuitable. When you aren’t sure that the split is right, just recalculate a new one.

About This Article

This article is from the book:

About the book authors:

John Paul Mueller is a prolific freelance author and technical editor. He's covered everything from networking and home security to database management and heads-down programming.

Luca Massaron is a data scientist who specializes in organizing and interpreting big data, turning it into smart data with data mining and machine learning techniques.

This article can be found in the category: