How to Explain the Predictive Analytical Results of R Regression

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Once you create an R regression model for predictive analytics, you want to be able to explain the results of the analysis. To see some useful information about the model, type in the following code:

> summary(model)

The output provides information that you can explore if you want to tweak your model further. For now, we’ll leave the model as it is. Here are the last two lines of the output:

Multiple R-squared: 0.8741, Adjusted R-squared: 0.8633 
F-statistic: 80.82 on 22 and 256 DF, p-value: < 2.2e-16

A couple of data points stand out here:

  • The Multiple R-squared value tells you how well the regression line fits the data (goodness of fit). A value of 1 means that it’s a perfect fit. So an r-squared value of 0.874 is good; it says that 87.4 percent of the variability in mpg is explained by the model.

  • The p-value tells you how significant the predictor variables affect the response variable. A p-value of less than (typically) 0.05 means that you can reject the null hypothesis that the predictor variables collectively have no effect on the response variable (mpg). The p-value of 2.2e-16 (that is, 2.2 with 16 zeroes in front of it) is much smaller than 0.05, so the predictors have an effect on the response.

With the model created, you can make predictions against it with the test data you partitioned from the full dataset. To use this model to predict the for each row in the test set, you issue the following command:

> predictions <- predict(model, testSet,
interval="predict", level=.95)

This is the code and output of the first six predictions:

> head(predictions)
  fit      lwr       upr
2 16.48993 10.530223 22.44964
4 18.16543 12.204615 24.12625
5 18.39992 12.402524 24.39732
6 12.09295  6.023341 18.16257
7 11.37966  5.186428 17.57289
8 11.66368  5.527497 17.79985

The output is a matrix that shows the predicted values in the fit column and the prediction interval in the lwr and upr columns — with a confidence level of 95 percent. The higher the confidence level, the wider the range, and vice versa.

The predicted value is in the middle of the range; so changing the confidence level doesn’t change the predicted value. The first column is the row number of the full dataset.

To see the actual and predicted values side by side so you can easily compare them, you can type in the following lines of code:

> comparison <- cbind(testSet$mpg, predictions[,1])
> colnames(comparison) <- c("actual", "predicted")

The first line creates a two-column matrix with the actual and predicted values. The second line changes the column names to actual and predicted. Type in the first line of code to get the output of the first six lines of comparison, as follows:

> head(comparison) 
   actual predicted
2     15  16.48993
4     16  18.16543
5     17  18.39992
6     15  12.09295
7     14  11.37966
8     14  11.66368

We also want to see a summary of the two columns to compare their means. This is the code and output of the summary:

> summary(comparison)
    actual        predicted  
Min.   :10.00   Min.   : 8.849 
1st Qu.:16.00   1st Qu.:17.070 
Median :21.50   Median :22.912 
Mean   :22.79   Mean   :23.048 
3rd Qu.:28.00   3rd Qu.:29.519 
Max.   :44.30   Max.   :37.643

Next you use the mean absolute percent error (mape), to measure the accuracy of our regression model. The formula for mean absolute percent error is


where Y is the actual score ,Y’ is the predicted score, and N is the number of predicted scores. After plugging the values into the formula, you get an error of only 10.94 percent. Here is the code and the output from the R console:

> mape <- (sum(abs(comparison[,1]-comparison[,2]) /   abs(comparison[,1]))/nrow(comparison))*100
> mape
[1] 10.93689

The following code enables you to view the results and errors in a table view:

> mapeTable <- cbind(comparison, abs(comparison[,1]-   comparison[,2])/comparison[,1]*100)
> colnames(mapeTable)[3] <- "absolute percent error"
> head(mapeTable)
    actual predicted     absolute percent error
2      15  16.48993              9.932889
4      16  18.16543             13.533952
5      17  18.39992              8.234840
6      15  12.09295             19.380309
7      14  11.37966             18.716708
8      14  11.66368             16.688031

Here’s the code that enables you to see the percent error again:

> sum(mapeTable[,3])/nrow(comparison)
[1] 10.93689