How to Create a Predictive Analytics Model with R Regression

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

You want to create a predictive analytics model that you can evaluate by using known outcomes. To do that, we’re going to split our dataset into two sets: one for training the model and one for testing the model. A 70/30 split between training and testing datasets will suffice. The next two lines of code calculate and store the sizes of each set:

> trainSize <- round(nrow(autos) * 0.7)
> testSize <- nrow(autos) - trainSize

To output the values, type in the name of the variable used to store the value and press Enter. Here is the output:

> trainSize[1] 279
> testSize[1] 119

This code determines the sizes of the datasets that you intend to make our training and test datasets. You still haven’t actually created those sets. Also, you don’t want simply to call the first 279 observations the training set and call the last 119 observations the test set. That would create a bad model because the dataset appears ordered. Specifically, the modelYear column is ordered from smallest to biggest.

From examining the data, you can see that most of the heavier, eight-cylinder, larger-displacement, greater-horsepower autos reside on the top of the dataset. From this observation, without having to run any algorithms on the data, you can already tell that (in general for this dataset) older cars compared to newer cars as follows:

  • Are heavier

  • Have eight cylinders

  • Have larger displacement

  • Have greater horsepower

Okay, obviously many people know something about automobiles, so a guess as to what the correlations are won’t be too farfetched after you see the data. Someone with a lot of automobile knowledge may have already known this without even looking at the data.

This is just a simple example of a domain (cars) that many people can relate to. If this was data about cancer, however, most people would not immediately understand what each attribute means.

This is where a domain expert and a data modeler are vital to the modeling process. Domain experts may have the best knowledge of which attributes may be the most (or least) important — and how attributes correlate with each other.

They can suggest to the data modeler which variables to experiment with. They can give bigger weights to more important attributes and/or smaller weights to attributes of least importance (or remove them altogether).

So you have to make a training dataset and a test dataset that are truly representative of the entire set. One way to do so is to create the training set from a random selection of the entire dataset. Additionally, you want to make this test reproducible so you can learn from the same example.

Thus set the seed for the random generator so we’ll have the same “random” training set. The following code does that task:

> set.seed(123)
> training_indices <- sample(seq_len(nrow(autos)),
size=trainSize) > trainSet <- autos[training_indices, ] > testSet <- autos[-training_indices, ]

The training set contains 279 observations, along with the outcome (mpg) of each observation. The regression algorithm uses the outcome to train the model by looking at the relationships between the predictor variables (any of the seven attributes) and the response variable (mpg).

The test set contains the rest of the data (that is, the portion not included in the training set). You should notice that the test set also includes the response (mpg) variable.

When you use the predict function (from the model) with the test set, it ignores the response variable and only uses the predictor variables as long as the column names are the same as those in the training set.

To create a linear regression model that uses the mpg attribute as the response variable and all the other variables as predictor variables, type in the following line of code:

> model <- lm(formula=trainSet$mpg ~ . , data=trainSet)