Comparing Averages: How Situational Differences Determine Test Methods
You may wonder why there are so many tests for such a simple task as comparing averages. Well, “comparing averages” doesn’t refer to a single task; it’s a broad term that can apply to a lot of situations that differ from each other on the basis of:
Whether you’re looking at changes over time within one group of subjects or differences between groups of subjects (or both)
How many time points or groups of subjects you’re comparing
Whether or not the numeric variable you’re comparing is nearly normally distributed
Whether or not the numbers have the same spread (standard deviation) in all the groups you’re comparing
Whether you want to compensate for the possible effects of some other variable on the variable you’re comparing
These different conditions can occur in any and all combinations, so there are lots of possible situations.
Comparing the mean of a group of numbers to a hypothesized value
Comparison of an observed mean to a particular value arises in studies where, for some reason, you can’t have a control group (such as a group taking a placebo or an untreated group), so you have to compare your results to a historical control, such as information from the literature.
It also comes up when you’re dealing with data like test scores that have been scaled to have some specific mean in the general population (such as 100 for IQ scores).
This data is usually analyzed by the one-group Student t test. For non-normal data, the Wilcoxon Signed-Ranks (WSR) test can be used instead.
Comparing two groups of numbers
Perhaps the most common situation is one in which you’re comparing two groups of numbers. You may want to compare some proposed biomarker of a medical condition between a group of subjects known to have that condition and a group known to not have it.
Or you may want to compare some measure of drug efficacy between subjects treated with the drug and subjects treated with a placebo.
Or maybe you want to compare the blood level of some enzyme between a sample of males and females.
Such comparisons are generally handled by the famous unpaired or “independent sample“ Student t test (usually just called the t test). But the t test is based on two assumptions about the distribution of data in the two groups:
The numbers are normally distributed (called the normality assumption). For non-normal data you can use the nonparametric Mann-Whitney (M-W) test, which your software may refer to as the Wilcoxon Sum-of-Ranks (WSOR) test. The WSOR was developed first but was restricted to equal-size groups; the M-W test generalized the WSOR test to work for equal or unequal group sizes.
The standard deviation (SD) is the same for both groups (called the equal-variance assumption because the variance is simply the square of the SD; thus, if the two SDs are the same, the two variances will also be the same).
If the two groups have noticeably different variances (if, for example, the SD of one group is more than 1.5 times as large as the SD of the other), then the t test may not give reliable results, especially with unequal size groups. Instead, you can use a special modification to the Student t test, called the Welch test (also called the Welch t test, or the unequal-variance t test).
Comparing three or more groups of numbers
Comparing three or more groups of numbers is an obvious extension of the two-group comparison in the preceding section. For example, you may compare some efficacy endpoint, like response to treatment, among three treatment groups (for example, drug A, drug B, and placebo). This kind of comparison is handled by the analysis of variance (ANOVA).
When there is one grouping variable, like treatment, you have a one-way ANOVA. If the grouping variable has three levels (like drug A, drug B, and placebo in the earlier example), it’s called a one-way, three-level ANOVA.
The null hypothesis of the one-way ANOVA is that all the groups have the same mean; the alternative hypothesis is that at least one group is different from at least one other group. The ANOVA produces a single p value, and if that p is less than your chosen criterion (such as p < 0.05), you can conclude that something’s different somewhere.
But the ANOVA doesn’t tell you which groups are different from which others. For that, you need to follow a significant ANOVA with one or more so-called post-hoc tests, which look for differences between each pair of groups.
You can also use the ANOVA to compare just two groups; this one-way, two-level ANOVA produces exactly the same p value as the classic unpaired equal-variance Student t test.