How to Evaluate Linear Data with R - dummies

How to Evaluate Linear Data with R

By Andrie de Vries, Joris Meys

Naturally, R provides a whole set of different tests and measures to evaluate how well your model fits your data as well as look at the model assumptions. Again, the overview presented here is far from complete, but it gives you an idea of what’s possible and a starting point to look deeper into the issue.

How to summarize the model

The summary() function immediately returns you the F test for models constructed with aov(). For lm() models, this is slightly different. Take a look at the output:

> Model.summary <- summary(Model)
> Model.summary
lm(formula = mpg ~ wt, data = mtcars)
  Min   1Q Median   3Q   Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
      Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851   1.8776 19.858 < 2e-16 ***
wt      -5.3445   0.5591 -9.559 1.29e-10 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,              Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

That’s a whole lot of useful information. Here you see the following:

  • The distribution of the residuals, which gives you a first idea about how well the assumptions of a linear model hold

  • The coefficients accompanied by a t-test, telling you in how far every coefficient differs significantly from zero

  • The goodness-of-fit measures R2 and the adjusted R2

  • The F-test that gives you an idea about whether your model explains a significant portion of the variance in your data

You can use the coef() function to extract a matrix with the estimates, standard errors, and t-value and p-value for the coefficients from the summary object like this:

> coef(Model.summary)    
       Estimate Std. Error  t value   Pr(>|t|)
(Intercept) 37.285126  1.877627 19.857575 8.241799e-19
wt     -5.344472  0.559101 -9.559044 1.293959e-10

If these terms don’t tell you anything, look them up in a good source about modeling. For an extensive introduction to applying and interpreting linear models correctly, check out Applied Linear Statistical Models, 5th Edition, by Michael Kutner et al (McGraw-Hill/Irwin).

How to test the impact of model terms

To get an analysis of variance table — like the summary() function makes for an ANOVA model — you simply use the anova() function and pass it the lm() model object as an argument, like this:

> Model.anova <- anova(Model)
> Model.anova
Analysis of Variance Table
Response: mpg
     Df Sum Sq Mean Sq F value  Pr(>F)
wt     1 847.73 847.73 91.375 1.294e-10 ***
Residuals 30 278.32  9.28
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the resulting object is a data frame that allows you to extract any value from that table using the subsetting and indexing tools. For example, to get the p-value, you can do the following:

> Model.anova['wt','Pr(>F)']
[1] 1.293959e-10

You can interpret this value as the probability that adding the variable wt to the model doesn’t make a difference. The low p-value here indicates that the weight of a car (wt) explains a significant portion of the difference in mileage (mpg) between cars. This shouldn’t come as a surprise; a heavier car does, indeed, need more power to drag its own weight around.

You can use the anova() function to compare different models as well, and many modeling packages provide that functionality. You find examples of this on most of the related Help pages like ?anova.lm and ?anova.glm.