How to Create Classification and Regression Trees in Python for Data Science
Data scientists call trees that specialize in guessing classes in Python classification trees; trees that work with estimation instead are known as regression trees. Here’s a classification problem, using the Fisher’s Iris dataset:
from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target features = iris.feature_names
After loading the data into X, which contains predictors, and y, which holds the classifications, you can define a cross-validation for checking the results using decision trees:
from sklearn.cross_validation import cross_val_score from sklearn.cross_validation import KFold crossvalidation = KFold(n=X.shape, n_folds=5, shuffle=True, random_state=1)
Using the DecisionTreeClassifier class, you define max_depth inside an iterative loop to experiment with the effect of increasing the complexity of the resulting tree. The expectation is to reach an ideal point quickly and then witness decreasing cross-validation performance because of overfitting:
from sklearn import tree for depth in range(1,10): tree_classifier = tree.DecisionTreeClassifier( max_depth=depth, random_state=0) if tree_classifier.fit(X,y).tree_.max_depth < depth: break score = np.mean(cross_val_score(tree_classifier, X, y, scoring=‘accuracy’, cv=crossvalidation, n_jobs=1)) print ‘Depth: %i Accuracy: %.3f’ % (depth,score) Depth: 1 Accuracy: 0.580 Depth: 2 Accuracy: 0.913 Depth: 3 Accuracy: 0.920 Depth: 4 Accuracy: 0.940 Depth: 5 Accuracy: 0.920
The best solution is a tree with four splits. Check out the complexity of the resulting tree.
To obtain an effective reduction and simplification, you can set min_samples_split to 30 and avoid terminal leaves that are too small. This setting prunes the small terminal leaves in the new resulting tree, diminishing crossvalidation accuracy but increasing simplicity and the generalization power of the solution.
tree_classifier = tree.DecisionTreeClassifier( min_samples_split=30, min_samples_leaf=10, random_state=0) tree_classifier.fit(X,y) score = np.mean(cross_val_score(tree_classifier, X, y, scoring=‘accuracy’, cv=crossvalidation, n_jobs=1)) print ‘Accuracy: %.3f’ % score Accuracy: 0.913
Similarly, using the DecisionTreeRegressor class, you can model a regression problem, such as the Boston house price dataset. When dealing with a regression tree, the terminal leaves offer the average of the cases as the prediction output.
from sklearn.datasets import load_boston boston = load_boston() X, y = boston.data, boston.target features = boston.feature_names from sklearn.tree import DecisionTreeRegressor regression_tree = tree.DecisionTreeRegressor( min_samples_split=30, min_samples_leaf=10, random_state=0) regression_tree.fit(X,y) score = np.mean(cross_val_score(regression_tree, X, y, scoring=‘mean_squared_error’, cv=crossvalidation, n_jobs=1)) print ‘Mean squared error: %.3f’ % abs(score) Mean squared error: 22.593