How to Evaluate the Differences in Your Data with R - dummies

How to Evaluate the Differences in Your Data with R

By Andrie de Vries, Joris Meys

To check the data model that you created with ANOVA (analysis of variance), you can use R’s summary() function on the model object like this:

> summary(AOVModel)
      Df Sum Sq Mean Sq F value Pr(>F)
spray    5  2669  533.8  34.7 <2e-16 ***
Residuals  66  1015  15.4
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R prints you the analysis of variance table that, in essence, tells you whether the different terms can explain a significant portion of the variance in your data. This table tells you only something about the term, but nothing about the differences between the different sprays. For that, you need to dig a bit deeper.

How to check data model tables

With the model.tables() function, you can take a look at the results for the individual levels of the factors. The function allows you to create two different tables; either you look at the estimated mean result for each group, or you look at the difference with the overall mean.

To know how much effect every spray had, you use the following code:

> model.tables(AOVModel, type='effects')
Tables of effects
   A   B   C   D   E   F
 5.000 5.833 -7.417 -4.583 -6.000 7.167

Here you see that, for example, spray E resulted, on average, in six bugs fewer than the average over all fields. On the other hand, on fields where spray A was used, the farmers found, on average, five bugs more compared to the overall mean.

To get the modeled means per group and the overall mean, just use the argument value type=’means’ instead of type=’effects’.

How to look at individual differences in data

A farmer probably wouldn’t consider buying spray A, but what about spray D? Although sprays E and C seem to be better, they also can be a lot more expensive. To test whether the pairwise differences between the sprays are significant, you use Tukey’s Honest Significant Difference (HSD) test. The TukeyHSD() function allows you to do that very easily, like this:

> Comparisons <- TukeyHSD(Model)

The Comparisons object now contains a list where every element is named after one factor in the model. In the example, you have only one element, called spray. This element contains, for every combination of sprays, the following:

  • The difference between the means.

  • The lower and upper level of the 95 percent confidence interval around that mean difference.

  • The p-value that tells you whether this difference is significantly different from zero. This p-value is adjusted using the method of Tukey (hence, the column name p adj).

You can extract all that information using the classical methods for extraction. For example, you get the information about the difference between D and C like this:

> Comparisons$spray['D-C',]
   diff    lwr    upr   p adj
 2.8333333 -1.8660752 7.5327418 0.4920707

That difference doesn’t look impressive, if you ask Tukey.

How to plot the differences

The TukeyHSD object has another nice feature: It can be plotted. Don’t bother looking for a Help page of the plot function — all you find is one sentence: “There is a plot method.” But it definitely works! Try it out like this:

> plot(Comparisons, las=1)

You see the output of this simple line. Each line represents the mean difference between both groups with the according confidence interval. Whenever the confidence interval doesn’t include zero (the vertical line), the difference between both groups is significant.

You can use some of the graphical parameters to make the plot more readable. Specifically, the las parameter is useful here. By setting it to 1, you make sure all axis labels are printed horizontally so you can actually read them.