 How to Explain the Predictive Analytical Results of R Regression - dummies

How to Explain the Predictive Analytical Results of R Regression

Once you create an R regression model for predictive analytics, you want to be able to explain the results of the analysis. To see some useful information about the model, type in the following code:

> summary(model)

The output provides information that you can explore if you want to tweak your model further. For now, we’ll leave the model as it is. Here are the last two lines of the output:

Multiple R-squared: 0.8741, Adjusted R-squared: 0.8633
F-statistic: 80.82 on 22 and 256 DF, p-value: < 2.2e-16

A couple of data points stand out here:

• The Multiple R-squared value tells you how well the regression line fits the data (goodness of fit). A value of 1 means that it’s a perfect fit. So an r-squared value of 0.874 is good; it says that 87.4 percent of the variability in mpg is explained by the model.

• The p-value tells you how significant the predictor variables affect the response variable. A p-value of less than (typically) 0.05 means that you can reject the null hypothesis that the predictor variables collectively have no effect on the response variable (mpg). The p-value of 2.2e-16 (that is, 2.2 with 16 zeroes in front of it) is much smaller than 0.05, so the predictors have an effect on the response.

With the model created, you can make predictions against it with the test data you partitioned from the full dataset. To use this model to predict the for each row in the test set, you issue the following command:

> predictions <- predict(model, testSet,
interval="predict", level=.95)

This is the code and output of the first six predictions:

fit      lwr       upr
2 16.48993 10.530223 22.44964
4 18.16543 12.204615 24.12625
5 18.39992 12.402524 24.39732
6 12.09295  6.023341 18.16257
7 11.37966  5.186428 17.57289
8 11.66368  5.527497 17.79985

The output is a matrix that shows the predicted values in the fit column and the prediction interval in the lwr and upr columns — with a confidence level of 95 percent. The higher the confidence level, the wider the range, and vice versa.

The predicted value is in the middle of the range; so changing the confidence level doesn’t change the predicted value. The first column is the row number of the full dataset.

To see the actual and predicted values side by side so you can easily compare them, you can type in the following lines of code:

> comparison <- cbind(testSet\$mpg, predictions[,1])
> colnames(comparison) <- c("actual", "predicted")

The first line creates a two-column matrix with the actual and predicted values. The second line changes the column names to actual and predicted. Type in the first line of code to get the output of the first six lines of comparison, as follows:

actual predicted
2     15  16.48993
4     16  18.16543
5     17  18.39992
6     15  12.09295
7     14  11.37966
8     14  11.66368

We also want to see a summary of the two columns to compare their means. This is the code and output of the summary:

> summary(comparison)
actual        predicted
Min.   :10.00   Min.   : 8.849
1st Qu.:16.00   1st Qu.:17.070
Median :21.50   Median :22.912
Mean   :22.79   Mean   :23.048
3rd Qu.:28.00   3rd Qu.:29.519
Max.   :44.30   Max.   :37.643

Next you use the mean absolute percent error (mape), to measure the accuracy of our regression model. The formula for mean absolute percent error is

(Σ(|Y-Y’|/|Y|)/N)*100

where Y is the actual score ,Y’ is the predicted score, and N is the number of predicted scores. After plugging the values into the formula, you get an error of only 10.94 percent. Here is the code and the output from the R console:

> mape <- (sum(abs(comparison[,1]-comparison[,2]) /   abs(comparison[,1]))/nrow(comparison))*100
> mape
 10.93689

The following code enables you to view the results and errors in a table view:

> mapeTable <- cbind(comparison, abs(comparison[,1]-   comparison[,2])/comparison[,1]*100)
> colnames(mapeTable) <- "absolute percent error"