Choosing the Right Algorithm for Machine Learning

By John Paul Mueller, Luca Massaron

Part of Machine Learning For Dummies Cheat Sheet

Machine learning involves the use of many different algorithms. This table gives you a quick summary of the strengths and weaknesses of various algorithms.

Algorithm Best at Pros Cons
Random Forest Apt at almost any machine learning problem


Bioinformatics
Can work in parallel


Seldom overfits


Automatically handles missing values


No need to transform any variable


No need to tweak parameters


Can be used by almost anyone with excellent results
Difficult to interpret


Weaker on regression when estimating values at the extremities of the distribution of response values


Biased in multiclass problems toward more frequent classes
Gradient Boosting Apt at almost any machine learning problem


Search engines (solving the problem of learning to rank)
It can approximate most nonlinear function


Best in class predictor


Automatically handles missing values


No need to transform any variable
It can overfit if run for too many iterations


Sensitive to noisy data and outliers


Doesn’t work well without parameter tuning
Linear regression Baseline predictions


Econometric predictions


Modelling marketing responses
Simple to understand and explain


It seldom overfits


Using L1 & L2 regularization is effective in feature selection


Fast to train


Easy to train on big data thanks to its stochastic version
You have to work hard to make it fit nonlinear functions


Can suffer from outliers
Support Vector Machines Character recognition


Image recognition


Text classification
Automatic nonlinear feature creation


Can approximate complex nonlinear functions
Difficult to interpret when applying nonlinear kernels


Suffers from too many examples, after 10,000 examples it starts taking too long to train
K-nearest Neighbors Computer vision


Multilabel tagging


Recommender systems


Spell checking problems
Fast, lazy training


Can naturally handle extreme multiclass problems (like tagging text)
Slow and cumbersome in the predicting phase


Can fail to predict correctly due to the curse of dimensionality
Adaboost Face detection Automatically handles missing values


No need to transform any variable


It doesn’t overfit easily


Few parameters to tweak


It can leverage many different weak-learners
Sensitive to noisy data and outliers


Never the best in class predictions
Naive Bayes Face recognition


Sentiment analysis


Spam detection


Text classification
Easy and fast to implement, doesn’t require too much memory and can be used for online learning


Easy to understand


Takes into account prior knowledge
Strong and unrealistic feature independence assumptions


Fails estimating rare occurrences


Suffers from irrelevant features
Neural Networks Image recognition


Language recognition and translation


Speech recognition


Vision recognition
Can approximate any nonlinear function


Robust to outliers


Works only with a portion of the examples (the support vectors)
Very difficult to set up


Difficult to tune because of too many parameters and you have also to decide the architecture of the network


Difficult to interpret


Easy to overfit
Logistic regression Ordering results by probability


Modelling marketing responses
Simple to understand and explain


It seldom overfits


Using L1 & L2 regularization is effective in feature selection


The best algorithm for predicting probabilities of an event


Fast to train


Easy to train on big data thanks to its stochastic version
You have to work hard to make it fit nonlinear functions


Can suffer from outliers
SVD Recommender systems Can restructure data in a meaningful way Difficult to understand why data has been restructured in a certain way
PCA Removing collinearity


Reducing dimensions of the dataset
Can reduce data dimensionality Implies strong linear assumptions (components are a weighted summations of features)
K-means Segmentation Fast in finding clusters


Can detect outliers in multiple dimensions
Suffers from multicollinearity


Clusters are spherical, can’t detect groups of other shape


Unstable solutions, depends on initialization