How to Handle Multiplicity in Clinical Trial Data

By John Pezzullo

Every time you perform a statistical significance test, you run a chance of being fooled by random fluctuations into thinking that some real effect is present in your data when, in fact, none exists.

This scenario is called a Type I error. When you say that you require p < 0.05 for significance, you’re testing at the 0.05 (or 5 percent) alpha level or saying that you want to limit your Type I error rate to 5 percent. But that 5 percent error rate applies to each and every statistical test you run.

The more analyses you perform on a data set, the more your overall alpha level increases: Perform two tests and your chance of at least one of them coming out falsely significant is about 10 percent; run 40 tests, and the overall alpha level jumps to 87 percent. This is referred to as the problem of multiplicity, or as Type I error inflation.

Some statistical methods involving multiple comparisons (like post-hoc tests following an ANOVA for comparing several groups) incorporate a built-in adjustment to keep the overall alpha at only 5 percent across all comparisons. But when you’re testing different hypotheses, like comparing different variables at different time points between different groups, it’s up to you to decide what kind of alpha control strategy (if any) you want to implement.

You have several choices, including the following:

  • Don’t control for multiplicity and accept the likelihood that some of your “significant” findings will be falsely significant. This strategy is often used with hypotheses related to secondary and exploratory objectives; the protocol usually states that no final inferences will be made from these exploratory tests. Any “significant” results will be considered only “signals” of possible real effects and will have to be confirmed in subsequent studies before any final conclusions are drawn.

  • Control the alpha level across only the most important hypotheses. If you have two co-primary objectives, you can control alpha across the tests of those two objectives.

    You can control alpha to 5 percent (or to any level you want) across a set of n hypothesis tests in several ways; following are some popular ones:

    • The Bonferroni adjustment: Test each hypothesis at the 0.05/n alpha level. So to control overall alpha to 0.05 across two primary endpoints, you need p < 0.025 for significance when testing each one.

    • A hierarchical testing strategy: Rank your endpoints in descending order of importance. Test the most important one first, and if it gives p < 0.05, conclude that the effect is real. Then test the next most important one, again using p < 0.05 for significance.

      Continue until you get a nonsignificant result (p > 0.05); then stop testing (or consider all further tests to be only exploratory and don’t draw any formal conclusions about them).

    • Controlling the false discovery rate (FDR): This approach has become popular in recent years to deal with large-scale multiplicity, which arises in areas like genomic testing and digital image analysis that may involve many thousands of tests (such as one per gene or one per pixel) instead of just a few.

      Instead of trying to avoid even a single false conclusion of significance (as the Bonferroni and other classic alpha control methods do), you simply want to control the proportion of tests that come out falsely positive, limiting that false discovery rate to some reasonable fraction of all the tests. These positive results can then be tested in a follow-up study.