Decision Trees in R - dummies

By Joseph Schmuller

R has a package that uses recursive partitioning to construct decision trees. It’s called rpart, and its function for constructing trees is called rpart(). To install the rpart package, click Install on the Packages tab and type rpart in the Install Packages dialog box. Then, in the dialog box, click the Install button. After the package downloads, find rpart in the Packages tab and click to select its check box.

Growing the tree in R

To create a decision tree for the iris.uci data frame, use the following code:

library(rpart)
iris.tree <- rpart(species ~ sepal.length + sepal.width + petal.length + petal.width, 
                iris.uci, method="class")

The first argument to rpart() is a formula indicating that species depends on the other four variables. [The tilde (~) means “depends on.”] The second argument is the data frame you’re using. The method = “class” argument (it’s the third one) tells rpart() that this is a classification tree. (For a regression tree, it’s method = “anova”.)

You can abbreviate the whole right side of the formula with a period. So the shorthand version is

species ~ .

The left side of the code, iris.tree, is called an rpart object. So rpart() creates an rpart object.

At this point, you can type the rpart object

iris.tree

and see text output that describes the tree:

n= 150

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
  2) petal.length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) * 3) petal.length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)
    6) petal.width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) * 7) petal.width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *

The first line indicates that this tree is based on 150 cases. The second line provides a key for understanding the output. The third line tells you that an asterisk denotes that a node is a leaf.

Each row corresponds to a node on the tree. The first entry in the row is the node number followed by a right parenthesis. The second is the variable and the value that make up the split. The third is the number of classified cases at that node. The fourth, loss, is the number of misclassified cases at the node. Misclassified? Compared to what? Compared to the next entry, yval, which is the tree’s best guess of the species at that node. The final entry is a parenthesized set of proportions that correspond to the proportion of each species at the node.

You can see the perfect classification in node 2, where loss (misclassification) is 0. By contrast, in nodes 6 and 7 loss is not 0. Also, unlike node 2, the parenthesized proportions for nodes 6 and 7 do not show 1.00 in the slots that represent the correct species. So the classification rules for versicolor and virginica result in small amounts of error.

Drawing the tree in R

Now you plot the decision tree, and you can see how it corresponds to the rpart() output. You do this with a function called prp(), which lives in the rpart.plot package.

The rpart package has a function called plot.rpart(), which is supposed to plot a decision tree. You might find that your version of R can’t find it. It can find the function’s documentation via ?plot.rpart but it can’t find the function. Weird.

With rpart.plot installed, here’s the code that plots the tree below:

library(rpart.plot)
prp(iris.tree,type=2,extra="auto",nn = TRUE,branch=1,varlen=0,yesno=2)
decision tree R
Decision tree for iris.uci, created by rpart() and rendered by prp().

The first argument to prp() is the rpart object. That’s the only argument that’s necessary. Think of the rpart object as a set of specifications for plotting the tree. You can add the other arguments to make the plot prettier:

  • type = 2 means “label all the nodes”
  • extra = “auto” tells prp() to include the information you see in each rounded rectangle that’s in addition to the species name
  • nn = TRUE puts the node-number on each node
  • branch = 1 indicates the lines-with-corners style of branching. These are called “square-shouldered branches”, believe it or not. For slump-shouldered branches (I made that up) try a value between 0 and 1
  • varlen=0 produces the full variable names on all the nodes (instead of names truncated to 8 characters)
  • yesno=2 puts yes or no on all the appropriate branches (instead of just the ones descending from the root, which is the default). Note that each left branch is yes and each right branch is no

At the root node and the internal node, you see the split. The rounded rectangle at each node shows a species name, three proportions, and the percentage of the data encompassed at that node.

At the root, the proportions are .33 for each species, and 100 percent of the data is at the root. The split (petal.length < 2.4) puts 33 percent of the data at the setosa leaf and 67 percent at the internal node. The setosa leaf shows the proportions 1.00, .00, and .00, indicating that all the cases at that leaf are perfectly classified as setosas.

The internal node shows .00, .50, and .50, which means none of these cases are setosas, half are versicolor, and half are virginica. The internal node split (petal.width < 1.8) puts 36 percent of the cases into the versicolor leaf and the 31 percent of the cases into the virginica leaf. Already this shows a problem: With perfect classification those percentages would be equal, because each species shows up equally in the data.

On the versicolor leaf, the proportions are .00, .91, and .09. This means 9 percent of cases classified as versicolor are actually virginica. On the virginica leaf, the proportions are .00, .02, and .98. So 2 percent of the cases classified as virginica are really versicolor.

Bottom line: For the great majority of the 150 cases in the data, the classification rules in the decision tree get the job done. But the rules aren’t perfect, which is typically the case with a decision tree.