Decision Trees in R
R has a package that uses recursive partitioning to construct decision trees. It’s called
rpart, and its function for constructing trees is called
rpart(). To install the
rpart package, click Install on the Packages tab and type rpart in the Install Packages dialog box. Then, in the dialog box, click the Install button. After the package downloads, find
rpart in the Packages tab and click to select its check box.
To create a decision tree for the
iris.uci data frame, use the following code:
library(rpart) iris.tree <- rpart(species ~ sepal.length + sepal.width + petal.length + petal.width, iris.uci, method="class")
The first argument to
rpart() is a formula indicating that
species depends on the other four variables. [The tilde (~) means “depends on.”] The second argument is the data frame you’re using. The
method = “class” argument (it’s the third one) tells
rpart() that this is a classification tree. (For a regression tree, it’s
method = “anova”.)
You can abbreviate the whole right side of the formula with a period. So the shorthand version is
species ~ .
The left side of the code,
iris.tree, is called an rpart object. So
rpart() creates an rpart object.
At this point, you can type the rpart object
and see text output that describes the tree:
n= 150 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333) 2) petal.length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) * 3) petal.length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000) 6) petal.width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) * 7) petal.width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
The first line indicates that this tree is based on 150 cases. The second line provides a key for understanding the output. The third line tells you that an asterisk denotes that a node is a leaf.
Each row corresponds to a node on the tree. The first entry in the row is the node number followed by a right parenthesis. The second is the variable and the value that make up the split. The third is the number of classified cases at that node. The fourth,
loss, is the number of misclassified cases at the node. Misclassified? Compared to what? Compared to the next entry,
yval, which is the tree’s best guess of the species at that node. The final entry is a parenthesized set of proportions that correspond to the proportion of each species at the node.
You can see the perfect classification in node 2, where
loss (misclassification) is 0. By contrast, in nodes 6 and 7
loss is not 0. Also, unlike node 2, the parenthesized proportions for nodes 6 and 7 do not show 1.00 in the slots that represent the correct species. So the classification rules for versicolor and virginica result in small amounts of error.
Drawing the tree in R
Now you plot the decision tree, and you can see how it corresponds to the
rpart() output. You do this with a function called
prp(), which lives in the
rpart package has a function called
plot.rpart(), which is supposed to plot a decision tree. You might find that your version of R can’t find it. It can find the function’s documentation via
?plot.rpart but it can’t find the function. Weird.
rpart.plot installed, here’s the code that plots the tree below:
library(rpart.plot) prp(iris.tree,type=2,extra="auto",nn = TRUE,branch=1,varlen=0,yesno=2)
The first argument to prp() is the rpart object. That’s the only argument that’s necessary. Think of the rpart object as a set of specifications for plotting the tree. You can add the other arguments to make the plot prettier:
type = 2means “label all the nodes”
extra = “auto”tells
prp()to include the information you see in each rounded rectangle that’s in addition to the species name
nn = TRUEputs the node-number on each node
branch = 1indicates the lines-with-corners style of branching. These are called “square-shouldered branches”, believe it or not. For slump-shouldered branches (I made that up) try a value between 0 and 1
varlen=0produces the full variable names on all the nodes (instead of names truncated to 8 characters)
noon all the appropriate branches (instead of just the ones descending from the root, which is the default). Note that each left branch is
yesand each right branch is
At the root node and the internal node, you see the split. The rounded rectangle at each node shows a species name, three proportions, and the percentage of the data encompassed at that node.
At the root, the proportions are .33 for each species, and 100 percent of the data is at the root. The split (
petal.length < 2.4) puts 33 percent of the data at the setosa leaf and 67 percent at the internal node. The setosa leaf shows the proportions
.00, indicating that all the cases at that leaf are perfectly classified as setosas.
The internal node shows
.50, which means none of these cases are setosas, half are versicolor, and half are virginica. The internal node split (
petal.width < 1.8) puts 36 percent of the cases into the versicolor leaf and the 31 percent of the cases into the virginica leaf. Already this shows a problem: With perfect classification those percentages would be equal, because each species shows up equally in the data.
On the versicolor leaf, the proportions are
.09. This means 9 percent of cases classified as versicolor are actually virginica. On the virginica leaf, the proportions are
.98. So 2 percent of the cases classified as virginica are really versicolor.
Bottom line: For the great majority of the 150 cases in the data, the classification rules in the decision tree get the job done. But the rules aren’t perfect, which is typically the case with a decision tree.