Growing a Random Forest in R
How do you create a forest out of a dataset in R? Well, randomly. Here’s what this means. You can create a decision tree from a dataset. It’s possible to use the
rattle package to partition a data frame into a training set, a validation set, and a test set. The partitioning takes place as a result of random sampling from the rows in the data frame. The default condition is that
rattle randomly assigns 70 percent of the rows to the training set, 15 percent to the validation set, and 15 percent to the test set.
The random row selection proceeds from a seed value, whose
Rattle default is
42. This produces the 70 percent of the observations for creating the decision tree. What happens if I change the seed value? The result is a different 70 percent of the sample and (potentially) a different tree. If you change the seed again and again and produce a decision tree each time (and save each tree), I create a forest.
The trees provide decision rules for the
iris.uci data frame. The data are measurements of the length and width of petals and sepals in 150 irises. They consist of 50 each of the setosa, versicolor, and virginica species. Given a flower’s measurements, a tree uses its decision rules to determine the flower’s species. you add
.uci to the data frame’s name to indicate that you downloaded it from the Machine Language Repository of the University of California-Irvine. A little data clean-up was necessary.
Each tree has its own decision rules, and the splits aren’t all based on the same variables. Instead of having only one tree decide a flower’s species, you can have all three of them make the determination. If they don’t all reach the same decision, the majority rules.
Now imagine hundreds of these trees, all created from the same data frame. In this setup, though, you randomly sample rows from the 70 percent of the rows designated as the training set, rather than create a new training set each time, as in the preceding example.
And then you add one more dimension of randomness: In addition to random selection of the data frame rows, suppose you add random selection of the variables to consider for each split of each decision tree.
So, here are two things to consider each time you grow a tree in the forest:
- For the data, you randomly select from the rows of the training set.
- For each split, you randomly select from the columns. (How many columns do you randomly select each time? A good rule of thumb is the square root of the number of columns.)
That’s a huge forest, with a lot of randomness! A technique like this one is useful when you have a lot of variables and relatively few observations (lots of columns and not so many rows, in other words).
R can grow a random forest for you.