Avoiding Sample Bias and Leakage Traps in Machine Learning

John Paul Mueller

Luca Massaron

Updated

2016-10-06 14:00:05

From the book

Machine Learning For Dummies

Download E-Book

TensorFlow For Dummies

Explore Book

Download E-Book

TensorFlow For Dummies

Explore Book

The validation approach to machine learning is an examination of a possible remedy to in-sampling bias. In-sampling bias can happen to your data before machine learning is put into action, and it causes high variance of the following estimates. In addition, you should be aware of leakage traps that can occur when some information from the out-of-sample passes to in-sample data. This issue can arise when you prepare the data or after your machine learning model is ready and working.

The remedy, which is called ensembling of predictors, works perfectly when your training sample is not completely distorted and its distribution is different from the out-of-sample, but not in an irremediable way, such as when all your classes are present but not in the right proportion (as an example). In such cases, your results are affected by a certain variance of the estimates that you can possibly stabilize in one of several ways: by resampling, as in bootstrapping; by subsampling (taking a sample of the sample); or by using smaller samples (which increases bias).

To understand how ensembling works so effectively, visualize the image of a bull’s eye. If your sample is affecting the predictions, some predictions will be exact and others will be wrong in a random way. If you change your sample, the right predictions will keep on being right, but the wrong ones will start being variations between different values. Some values will be the exact prediction you are looking for; others will just oscillate around the right one.

By comparing the results, you can guess that what is recurring is the right answer. You can also take an average of the answers and guess that the right answer should be in the middle of the values. With the bull’s-eye game, you can visualize superimposing photos of different games: If the problem is variance, ultimately you will guess that the target is in the most frequently hit area or at least at the center of all the shots.

In most cases, such an approach proves to be correct and improves your machine learning predictions a lot. When your problem is bias and not variance, using ensembling really doesn’t cause harm unless you subsample too few samples. A good rule of thumb for subsampling is to take a sample from 70 to 90 percent compared to the original in-sample data. If you want to make ensembling work, you should do the following:

Iterate a large number of times through your data and models (from just a minimum of three iterations to ideally hundreds of times of them).
Every time you iterate, subsample (or else bootstrap) your in-sample data.
Use machine learning for the model on the resampled data, and predict the out-of-sample results. Store those results away for later use.
At the end of the iterations, for every out-of-sample case you want to predict, take all its predictions and average them if you are doing a regression. Take the most frequent class if you are doing a classification.

Leakage traps can surprise you because they can prove to be an unknown and undetected source of problems with your machine learning processes. The problem is snooping, or otherwise observing the out-of-sample data too much and adapting to it too often. In short, snooping is a kind of overfitting — and not just on the training data but also on the test data, making the overfitting problem itself harder to detect until you get fresh data.

Usually you realize that the problem is snooping when you already have applied the machine learning algorithm to your business or to a service for the public, making the problem an issue that everyone can see.

You can avoid snooping in two ways. First, when operating on the data, take care to neatly separate training, validation, and test data. Also, when processing, never take any information from validation or test, even the most simple and innocent-looking examples. Worse still is to apply a complex transformation using all the data.

In finance, for instance, it is well known that calculating the mean and the standard deviation (which can actually tell you a lot about market conditions and risk) from all training and testing data can leak precious information about your models. When leakage happens, machine learning algorithms perform predictions on the test set rather than the out-of-sample data from the markets, which means that they didn’t work at all, thereby causing a loss of money.

Check the performance of your out-of-sample examples. In fact, you may bring back some information from your snooping on the test results to help you determine that certain parameters are better than others, or lead you to choose one machine learning algorithm instead of another. For every model or parameter, apply your choice based on cross-validation results or from the validation sample. Never fall for getting takeaways from your out-of-sample data or you’ll regret it later.

About This Article

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.

This article can be found in the category:

Machine Learning

Hot off the press

Explore Related content

TensorFlow For Dummies

Machine Learning For Dummies

Deep Learning For Dummies

Book & Article Categories

Book & Article Categories

Collections

Avoiding Sample Bias and Leakage Traps in Machine Learning

About This Article

About the book author:

This article can be found in the category:

Explore Related content

Book & Article Categories

Book & Article Categories

Collections

Avoiding Sample Bias and Leakage Traps in Machine Learning

About This Article

This article is from the book:

About the book author:

This article can be found in the category:

Explore Related content

What Is the gsutil Utility?

Machine Learning: Leveraging Decision Trees with Random Forest Ensembles

The Machine Learning Process

What Is Computer Vision?

How to Use Anaconda for Machine Learning

The Relationship between AI and Machine Learning

10 Applications that Require Deep Learning

Distinguishing Classification Tasks with Convolutional Neural Networks

10 Types of Jobs that Use Deep Learning

Deep Learning and Natural Language Processing

Using AI for Sentiment Analysis

Deep Learning and Recurrent Neural Networks

Machine Learning vs. Deep Learning: Explaining Deep Learning Differences from Other Forms of AI

What is Deep Learning?

Neural Networks and Deep Learning: Neural Network Differentiation

How Does Machine Learning Work?

Deep Learning For Dummies Cheat Sheet

TensorFlow For Dummies Cheat Sheet

How to Create Vector and Matrix Operations in TensorFlow

How to Create Rounding and Comparison TensorFlow Operations