Python Programming: Making Machine Learning Accessible with the Random Forest Algorithm - dummies

Python Programming: Making Machine Learning Accessible with the Random Forest Algorithm

By John Paul Mueller, Luca Massaron

Random Forest is a classification and regression algorithm developed by Leo Breiman and Adele Cutler that uses a large number of decision tree models to provide precise predictions by reducing both the bias and variance of the estimates. When you aggregate many models together to produce a single prediction, the result is an ensemble of models. Random Forest isn’t just an ensemble model, it’s also a simple and effective algorithm to use as an out-of-the-box algorithm. It makes machine learning accessible to nonexperts. The Random Forest algorithm uses these steps to perform its predictions:

  1. Create a large number of decision trees, each one different from the other, based on different subsets of observations and variables.
  2. Bootstrap the dataset of observations for each tree (sampled from the original data with replacement). The same observation can appear multiple times in the same dataset.
  3. Randomly select and use only a part of the variables for each tree.
  4. Estimate the performance for each tree using the observations excluded by sampling (the Out Of Bag, or OOB, estimate).
  5. Obtain the final prediction, which is the average for regression estimates or the most frequent class for prediction, after all the trees have been fitted and used for prediction.

You can reduce bias by using these steps, because the decision trees have a good fit on data and, by relying on complex splits, can approximate even the most complex relationships between predictors and predicted outcome. Decision trees can produce a great variance of estimates, but you reduce this variance by averaging many trees. Noisy predictions, due to variance, tend to distribute evenly above and below the correct value that you want to predict — and when averaged together, they tend to cancel each other, leaving, as a result, a more correct average prediction.

Leo Breiman derived the idea for Random Forest from the bagging technique. Scikit-learn has a bagging class for both regression (BaggingRegressor) and classifying (BaggingClassifier) that you can use with any other predictor you prefer to pick from Scikit-learn modules. The max_samples and max_features parameters let you decide the proportion of cases and variables to sample (not bootstrapped, but sampled, so you can use a case only once) for building each model of the ensemble. The n_estimators parameter decides the total number of models in the ensemble. Here’s an example that loads the handwritten digit dataset (used for demonstrations later with other ensemble algorithms) and then fits the model by bagging:

from sklearn.datasets import load_digits
digit = load_digits()
X, y = digit.data, digit.target
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
tree_classifier = DecisionTreeClassifier(random_state=0)
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
bagging = BaggingClassifier(tree_classifier,
max_samples=0.7,
max_features=0.7,
n_estimators=300)
scores = np.mean(cross_val_score(bagging, X, y,
scoring='accuracy',
cv=crossvalidation))
print ('Accuracy: %.3f' % scores)

Here’s the cross-validated accuracy for the bagging applied to the handwritten dataset:

Accuracy: 0.967

In bagging, as in Random Forest, the more models in the ensemble, the better. You run little risk of overfitting because every model is different from the others, and errors tend to spread around the real value. Adding more models just adds stability to the result.

Another characteristic of the algorithm is that it permits estimation of variable importance while taking the presence of all the other predictors into account. In this way, you can determine which feature is important for predicting a target given the set of features that you have; also, you can use the importance estimate as a guideline for variable selection.

In contrast to single decision trees, you can’t easily visualize or understand Random Forest, making it act as a black box (a black box is a transformation that doesn’t reveal its inner workings; all you see are its inputs and outputs). Given its opacity, importance estimation is the only way to understand how the algorithm works with respect to the features.

Importance estimation in a Random Forest is obtained in a straightforward way. After building each tree, the code fills each variable in turn with junk data and the example records how much the predictive power decreases. If the variable is important, crowding it with casual data harms the prediction; otherwise, the predictions are left almost unchanged and the variable is deemed unimportant.

Working with a Random Forest classifier

The example Random Forest classifier keeps using the previously loaded digit dataset:

X, y = digit.data, digit.target
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
RF_cls = RandomForestClassifier(n_estimators=300,
random_state=1)
score = np.mean(cross_val_score(RF_cls, X, y,
scoring='accuracy',
cv=crossvalidation))
print('Accuracy: %.3f' % score)

The cross-validated accuracy reported by this code for the Random Forest is an improvement over the bagging method:

Accuracy: 0.977

Just setting the number of estimators is sufficient for most problems you encounter, and setting it correctly is a matter of using the highest number possible given the time and resource constraints of the host computer. You can demonstrate this by calculating and drawing a validation curve for the algorithm.

from sklearn.model_selection import validation_curve
param_range = [10, 50, 100, 200, 300, 500, 800, 1000, 1500]
crossvalidation = KFold(n_splits=3,
shuffle=True,
random_state=1)
RF_cls = RandomForestClassifier(n_estimators=300,
random_state=0)
train_scores, test_scores = validation_curve(RF_cls, X, y,
'n_estimators',
param_range=param_range,
cv=crossvalidation,
scoring='accuracy')
mean_test_scores = np.mean(test_scores, axis=1)
import matplotlib.pyplot as plt
plt.plot(param_range, mean_test_scores,
'bD-.', label='CV score')
plt.grid()
plt.xlabel('Number of estimators')
plt.ylabel('accuracy')
plt.legend(loc='lower right', numpoints= 1)
plt.show()

The image below shows the results provided by the preceding code. The more estimators, the better the results. However, at a certain point the gain becomes minimal indeed.

Random Forest Python data science
Verifying the impact of the number of estimators on Random Forest.

Working with a Random Forest regressor

RandomForestRegressor works in a similar way as the Random Forest for classification, using exactly the same parameters:

X, y = boston.data, boston.target
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
RF_rg = RandomForestRegressor (n_estimators=300,
random_state=1)
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
score = np.mean(cross_val_score(RF_rg, X, y,
scoring='neg_mean_squared_error',
cv=crossvalidation))
print('Mean squared error: %.3f' % abs(score))

Here is the resulting cross-validated mean squared error:

Mean squared error: 12.028

The Random Forest uses decision trees. Decision trees https://www.dummies.com/programming/big-data/data-science/how-to-create-classification-and-regression-trees-in-python-for-data-science/ segment the dataset into small partitions, called leaves, when estimating regression values. The Random Forest takes the average of the values in each leaf to create a prediction. Using this procedure causes extreme and high values to disappear from predictions because of the averaging used for each leaf of the forest, producing dumped values instead of much higher or much lower values.

Optimizing a Random Forest

Random Forest models are out-of-box algorithms that can work quite well without optimization or worrying about overfitting. (The more estimators you use, the better the output, depending on your resources.) You can always improve performance by removing redundant and less informative variables, fixing a minimum leaf size, and defining a sampling number that avoids having too many correlated predictors in the sample. The following example shows how to perform these tasks:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
X, y = digit.data, digit.target
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
RF_cls = RandomForestClassifier(random_state=1)
scorer = 'accuracy'

Using the handwritten digit dataset and a first default classifier, you can optimize both max_features and min_samples_leaf. When optimizing max_features, you use preconfigured options (auto for all features, sqrt or log2 functions applied to the number of features) and integrate them using small feature numbers and a value of 1/3 of the features. Selecting the right number of features to sample tends to reduce the number of times when correlated and similar variables are picked together, thus increasing the predictive performances.

There is a statistical reason to optimize min_samples_leaf. Using leaves with few cases often corresponds to overfitting to very specific data combinations. You need to have at least 30 observations to achieve a minimal statistical confidence that data patterns correspond to real and general rules:

from sklearn.model_selection import GridSearchCV
max_features = [X.shape[1]//3, 'sqrt', 'log2', 'auto']
min_samples_leaf = [1, 10, 30]
n_estimators = [50, 100, 300]
search_grid = {'n_estimators':n_estimators,
'max_features': max_features,
'min_samples_leaf': min_samples_leaf}
search_func = GridSearchCV(estimator=RF_cls,
param_grid=search_grid,
scoring=scorer,
cv=crossvalidation)
search_func.fit(X, y)
best_params = search_func.best_params_
best_score = search_func.best_score_
print('Best parameters: %s' % best_params)
print('Best accuracy: %.3f' % best_score)

The best parameters and best accuracy obtained are then reported, highlighting that the parameters to act on is the number of trees:

Best parameters: {'max_features': 'sqrt',
'min_samples_leaf': 1,
'n_estimators': 100}
Best accuracy: 0.978