How to Create Classification and Regression Trees in Python for Data Science

By John Paul Mueller, Luca Massaron

Data scientists call trees that specialize in guessing classes in Python classification trees; trees that work with estimation instead are known as regression trees. Here’s a classification problem, using the Fisher’s Iris dataset:

from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
features = iris.feature_names

After loading the data into X, which contains predictors, and y, which holds the classifications, you can define a cross-validation for checking the results using decision trees:

from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
crossvalidation = KFold(n=X.shape[0], n_folds=5,
 shuffle=True, random_state=1)

Using the DecisionTreeClassifier class, you define max_depth inside an iterative loop to experiment with the effect of increasing the complexity of the resulting tree. The expectation is to reach an ideal point quickly and then witness decreasing cross-validation performance because of overfitting:

from sklearn import tree
for depth in range(1,10):
 tree_classifier = tree.DecisionTreeClassifier(
  max_depth=depth, random_state=0)
 if tree_classifier.fit(X,y).tree_.max_depth < depth:
 break
 score = np.mean(cross_val_score(tree_classifier, X, y,
  scoring=‘accuracy’, cv=crossvalidation, n_jobs=1))
 print ‘Depth: %i Accuracy: %.3f’ % (depth,score)
Depth: 1 Accuracy: 0.580
Depth: 2 Accuracy: 0.913
Depth: 3 Accuracy: 0.920
Depth: 4 Accuracy: 0.940
Depth: 5 Accuracy: 0.920

The best solution is a tree with four splits. Check out the complexity of the resulting tree.

A tree model of the Iris dataset using a depth of four splits.

A tree model of the Iris dataset using a depth of four splits.

To obtain an effective reduction and simplification, you can set min_samples_split to 30 and avoid terminal leaves that are too small. This setting prunes the small terminal leaves in the new resulting tree, diminishing cross­validation accuracy but increasing simplicity and the generalization power of the solution.

tree_classifier = tree.DecisionTreeClassifier(
 min_samples_split=30, min_samples_leaf=10,
  random_state=0)
tree_classifier.fit(X,y)
score = np.mean(cross_val_score(tree_classifier, X, y,
 scoring=‘accuracy’, cv=crossvalidation, n_jobs=1))
print ‘Accuracy: %.3f’ % score
Accuracy: 0.913

Similarly, using the DecisionTreeRegressor class, you can model a regression problem, such as the Boston house price dataset. When dealing with a regression tree, the terminal leaves offer the average of the cases as the prediction output.

from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
features = boston.feature_names
from sklearn.tree import DecisionTreeRegressor
regression_tree = tree.DecisionTreeRegressor(
 min_samples_split=30, min_samples_leaf=10,
  random_state=0)
regression_tree.fit(X,y)
score = np.mean(cross_val_score(regression_tree, X, y,
 scoring=‘mean_squared_error’, cv=crossvalidation,
  n_jobs=1))
print ‘Mean squared error: %.3f’ % abs(score)
Mean squared error: 22.593