How to Create an R Classification Predictive Analytics Model - dummies

How to Create an R Classification Predictive Analytics Model

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

You want to create a predictive analytics model that you can evaluate using known outcomes. To do that, split the seeds dataset into two sets: one for training the model and one for testing the model. A 70/30 split between training and testing datasets will suffice. The next two lines of code calculate and store the sizes of each dataset:

> trainSize <- round(nrow(seeds) * 0.7)
> testSize <- nrow(seeds) - trainSize

To output the values, type in the name of the variable that you used to store the value and press Enter. Here is the output:

> trainSize[1] 147> testSize[1] 63

This code determines the sizes for the training and testing datasets. You haven’t actually created the sets yet. Also, you don’t just want the first 147 observations to be the training set and the last 63 observations to be the test set. That would create a bad model because the seeds dataset is ordered in the label column.

Thus you have to make both the training set and the test set representative of the entire dataset. One way to do that is create the training set from a random selection of the entire dataset.

Additionally, you want to make this test reproducible so you can learn from the same example. You can do that by setting the dataset for the random generator so you have the same “random” training set, like this:

> set.seed(123)
> training_indices <- sample(seq_len(nrow(seeds)),
size=trainSize) > trainSet <- seeds[training_indices, ] > testSet <- seeds[-training_indices, ]

The training set you get from this code contains 147 observations along with an outcome (seedType) of each observation. When you create the model, you will tell the algorithm which variable is the outcome. The classification algorithm uses those outcomes to train the model by looking at the relationships between the predictor variables (any of the seven attributes) and the label (seedType).

The test set contains the rest of the data, that is, all data not included in the training set. Notice that the test set also includes the label (seedType). When you use the predict function (from the model) with the test set, it ignores the label and only uses the predictor variables, as long as the column names are the same as they are in the training set.

The party package is one of several packages in R that create decision trees. (Other common decision-tree packages include rpart, tree, and randomForest.) The next step is to use the package to create a decision-tree model, using seedType as the target variable and all the other variables as predictor variables. The first step in that process is to install the package and load it into our R session.

Type in the following lines of code to install and load the party package:

> install.packages("party")
> library(party)

You are now ready to train the model. Type in the following line of code:

> model <- ctree(seedType~., data=trainSet)

To make predictions with new data, you simply use the function ith a list of the seven attribute values. The following code does that:

> newPrediction <- predict(model, list(area=11,
perimeter=13, compactness=0.855, length=5,
width=2.8, asymmetry=6.5, length2=5),
interval="predict", level=.95)

This is the code and output of the new prediction value.

> newPrediction
  [1] 3
Levels: 1 2 3

The prediction was seed type 3, which is not surprising because values were deliberately chosen that were close to observation #165.