How to Compare Two Data Samples with R - dummies

How to Compare Two Data Samples with R

By Andrie de Vries, Joris Meys

R gives you two standard tests for comparing two groups with numerical data: the t-test with the t.test() function, and the Wilcoxon test with the wilcox.test() function. If you want to use the t.test() function, you first have to check, among other things, whether both samples are normally distributed. For the Wilcoxon test, this isn’t necessary.

How to use R’s Wilcoxon function for abnormally distributed data

In some cases, your data deviates significantly from normality and you can’t use the t.test() function. For those cases, you have the wilcox.test() function, which you use in exactly the same way, as shown in the following example:

> wilcox.test(temp ~ activ, data=beaver2)

This give you the following output:

 Wilcoxon rank-sum test with continuity correction
data: temp by activ
W = 15, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Again, you get the value for the test statistic (W in this test) and a p-value. Under that information, you read the alternative hypothesis, and that differs a bit from the alternative hypothesis of a t-test. The Wilcoxon test looks at whether the center of your data (the location) differs between both samples.

With this code, you perform the Wilcoxon rank-sum test or Mann-Whitney U test. Both tests are completely equivalent, so R doesn’t contain a separate function for Mann-Whitney’s U test.

How to use R’s T-Test and Wilcoxon test to test direction

With the basic T-Test and Wilcoxon test, you test whether the samples differ without specifying in which way. Statisticians call this a two-sided test. Imagine you don’t want to know whether body temperature differs between active and inactive periods, but whether body temperature is lower during inactive periods.

To do this, you have to specify the argument alternative in either the t.test() or wilcox.test() function. This argument can take three values:

  • By default, it has the value ‘two.sided’, which means you want the standard two-sided test.

  • If you want to test whether the mean (or location) of the first group is lower, you give it the value ‘less’.

  • If you want to test whether that mean is bigger, you specify the value ‘greater’.

If you use the formula interface for these tests, the groups are ordered in the same order as the levels of the factor you use. You have to take that into account to know which group is seen as the first group. If you give the data for both groups as separate vectors, the first vector is the first group.