Optimizing Cross-Validation Choices in Machine Learning

John Paul Mueller

Luca Massaron

Updated

2016-10-06 13:57:36

From the book

Machine Learning For Dummies

Download E-Book

TensorFlow For Dummies

Explore Book

Download E-Book

TensorFlow For Dummies

Explore Book

Being able to validate a machine learning hypothesis effectively allows further optimization of your chosen algorithm. The algorithm provides most of the predictive performance on your data, given its ability to detect signals from data and fit the true functional form of the predictive function without overfitting and generating much variance of the estimates. Not every machine learning algorithm is a best fit for your data, and no single algorithm can suit every problem. It’s up to you to find the right one for a specific problem.

A second source of predictive performance is the data itself when appropriately transformed and selected to enhance the learning capabilities of the chosen algorithm.

The final source of performance derives from fine-tuning the algorithm’s hyper-parameters, which are the parameters that you decide before learning happens and that aren’t learned from data. Their role is in defining a priori a hypothesis, whereas other parameters specify it a posteriori, after the algorithm interacts with the data and, by using an optimization process, finds that certain parameter values work better in obtaining good predictions.

Not all machine learning algorithms require much hyper-parameter tuning, but some of the most complex ones do, and though such algorithms still work out of the box, pulling the right levers may make a large difference in the correctness of the predictions. Even when the hyper-parameters aren’t learned from data, you should consider the data you’re working on when deciding hyper-parameters, and you should make the choice based on cross-validation and careful evaluation of possibilities.

Complex machine learning algorithms, the ones most exposed to variance of estimates, present many choices expressed in a large number of parameters. Twiddling with them makes them adapt more or less to the data they are learning from. Sometimes too much hyper-parameter twiddling may even make the algorithm detect false signals from the data. That makes hyper-parameters themselves an undetected source of variance if you start manipulating them too much based on some fixed reference like a test set or a repeated cross-validation schema.

Both R and Python offer slicing functionalities that slice your input matrix into train, test, and validation parts. In particular, for more complex testing procedures, such as cross-validation or bootstrapping, the Scikit-learn package offers an entire module, and R has a specialized package, offering functions for data splitting, preprocessing, and testing. This package is called caret.

The possible combinations of values that hyper-parameters may form make deciding where to look for optimizations hard. As described when discussing gradient descent, an optimization space may contain value combinations that perform better or worse. Even after you find a good combination, you’re not assured that it’s the best option. (This is the problem of getting stuck in local minima when minimizing the error.)

As a practical way of solving this problem, the best way to verify hyper-parameters for an algorithm applied to specific data is to test them all by cross-validation, and to pick the best combination. This simple approach, called grid-search, offers indisputable advantages by allowing you to sample the range of possible values to input into the algorithm systematically and to spot when the general minimum happens.

On the other hand, grid-search also has serious drawbacks because it’s computationally intensive (you can easily perform this task in parallel on modern multicore computers) and quite time consuming. Moreover, systematic and intensive tests enhance the possibility of incurring error because some good but fake validation results can be caused by noise present in the dataset.

Some alternatives to grid-search are available. Instead of testing everything, you can try exploring the space of possible hyper-parameter values guided by computationally heavy and mathematically complex nonlinear optimization techniques (like the Nelder-Mead method), using a Bayesian approach (where the number of tests is minimized by taking advantage of previous results) or using random search.

Surprisingly, random search works incredibly well, is simple to understand, and isn’t just based on blind luck, though it may initially appear to be. In fact, the main point of the technique is that if you pick enough random tests, you actually have enough possibilities to spot the right parameters without wasting energy on testing slightly different combinations of similarly performing combinations.

The graphical representation below explains why random search works fine. A systematic exploration, though useful, tends to test every combination, which turns into a waste of energy if some parameters don’t influence the result. A random search actually tests fewer combinations but more in the range of each hyper-parameter, a strategy that proves winning if, as often happens, certain parameters are more important than others.

Comparing grid-search to random search.

For randomized search to perform well, you should make from 15 to a maximum of 60 tests. It does make sense to resort to random search if a grid-search requires a larger number of experiments.

About This Article

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.

This article can be found in the category:

Machine Learning

Hot off the press

Explore Related content

TensorFlow For Dummies

Machine Learning For Dummies

Deep Learning For Dummies

Book & Article Categories

Book & Article Categories

Collections

Optimizing Cross-Validation Choices in Machine Learning

About This Article

About the book author:

This article can be found in the category:

Explore Related content

Book & Article Categories

Book & Article Categories

Collections

Optimizing Cross-Validation Choices in Machine Learning

About This Article

This article is from the book:

About the book author:

This article can be found in the category:

Explore Related content

What Is the gsutil Utility?

Machine Learning: Leveraging Decision Trees with Random Forest Ensembles

The Machine Learning Process

What Is Computer Vision?

How to Use Anaconda for Machine Learning

The Relationship between AI and Machine Learning

10 Applications that Require Deep Learning

Distinguishing Classification Tasks with Convolutional Neural Networks

10 Types of Jobs that Use Deep Learning

Deep Learning and Natural Language Processing

Using AI for Sentiment Analysis

Deep Learning and Recurrent Neural Networks

Machine Learning vs. Deep Learning: Explaining Deep Learning Differences from Other Forms of AI

What is Deep Learning?

Neural Networks and Deep Learning: Neural Network Differentiation

How Does Machine Learning Work?

Deep Learning For Dummies Cheat Sheet

TensorFlow For Dummies Cheat Sheet

How to Create Vector and Matrix Operations in TensorFlow

How to Create Rounding and Comparison TensorFlow Operations