Alpha Correction#
Every time you run a statistical test, you take a 5% risk of falsely finding a significant result (at alpha = 0.05). That sounds acceptable — for a single test. But what happens when you run 10, 20, or 50 tests at the same time? Those risks add up, and suddenly the probability of at least one false alarm is alarmingly high. This is exactly the problem that alpha correction addresses.
The Multiple Comparisons Problem#
Worked Example: Alpha Inflation
You run 10 independent tests with alpha = 0.05. The probability of not making a Type I error on a single test is 0.95.
The probability of making no error across all 10 tests:
So the probability of at least one Type I error is:
Instead of a 5% error risk, you now face 40%! With 20 tests it would be 64%.
This growing error risk is called the familywise error rate (FWER) — the probability of committing at least one Type I error within a family of tests.
Correction Methods#
Bonferroni Correction#
The simplest and best-known method. You divide your alpha level by the number of tests:
With 10 tests and alpha = 0.05, the corrected threshold becomes: 0.05/10 = 0.005. A result is only significant if p < 0.005.
Pros and Cons
Pros:
- Easy to compute and explain
- Works regardless of test dependencies
- Widely recognized and accepted
Cons:
- Very conservative, especially with many comparisons
- Low statistical power — real effects are easily missed
Holm-Bonferroni (Step-Down)#
An improved version of the Bonferroni correction that is less conservative while still controlling the FWER.
How it works:
- Sort all p-values from smallest to largest
- Compare the smallest p-value with alpha/m
- If significant, compare the second smallest with alpha/(m-1)
- Continue until a p-value fails to reach significance
- All remaining p-values are declared non-significant
Example: Holm Correction with 4 Comparisons
Four p-values (sorted): 0.003, 0.012, 0.030, 0.180
| Rank | p-value | Threshold (alpha/(m-rank+1)) | Significant? |
|---|---|---|---|
| 1 | 0.003 | 0.05/4 = 0.0125 | Yes |
| 2 | 0.012 | 0.05/3 = 0.0167 | Yes |
| 3 | 0.030 | 0.05/2 = 0.025 | No (Stop!) |
| 4 | 0.180 | 0.05/1 = 0.05 | No |
Result: The first two comparisons remain significant, the third does not (even though p = .030 < .05).
Benjamini-Hochberg (FDR Correction)#
This approach does not control the FWER but instead the False Discovery Rate (FDR) — the expected proportion of false discoveries among all significant results. This sounds more lenient, but in many situations it is the more sensible strategy.
How it works:
- Sort all p-values from smallest to largest
- Compare the largest p-value with alpha
- The second largest with alpha x (m-1)/m
- In general: p(i) with alpha x i/m
- The largest p-value meeting its criterion, and all smaller ones, are declared significant
When FDR Instead of FWER?
- FWER (Bonferroni, Holm): When even a single false positive has serious consequences (e.g., clinical trials, genomic studies with follow-up experiments)
- FDR (Benjamini-Hochberg): When you are running an exploratory analysis and a certain proportion of false discoveries is acceptable (e.g., exploratory gene expression studies, screening studies)
Comparison of Methods#
| Method | Controls | Conservativeness | Best Application |
|---|---|---|---|
| Bonferroni | FWER | Very conservative | Few comparisons, simple presentation |
| Holm-Bonferroni | FWER | Moderately conservative | Standard method, almost always better than Bonferroni |
| Benjamini-Hochberg | FDR | Liberal | Exploratory analyses, many tests |
| No correction | — | — | Only with a single pre-planned test |
When Is Correction Necessary?#
Not every situation requires an alpha correction. Here is some guidance:
Correction recommended:
- Multiple post-hoc comparisons after an ANOVA
- Many correlations in a correlation matrix
- Subgroup analyses without pre-specified hypotheses
Correction usually not needed:
- A single pre-planned comparison (primary endpoint)
- Orthogonal contrasts (they are independent of each other)
- Confirmatory study with one primary test
Common Misconceptions#
- "Bonferroni is always the right choice." — Holm-Bonferroni controls the FWER just as well but has more power. There is rarely a reason to prefer the simple Bonferroni correction.
- "Without correction all my results are invalid." — Alpha correction concerns the overall error rate. Individual tests with p = .001 are hardly due to chance even without correction.
- "FDR correction is not rigorous." — On the contrary: for exploratory analyses with many tests, FDR correction is often the more methodologically sound choice because it preserves more power.
- "I will just test fewer hypotheses so I do not need correction." — That is actually a legitimate strategy. Few, pre-planned comparisons reduce the problem.
Practical Tips#
- Plan ahead: Define before data collection which comparisons you will make
- Use Holm instead of Bonferroni: Same FWER control, more power
- Report transparently: State how many tests were performed and which correction was used
- Look at effect sizes: A significant p-value after correction says nothing about practical relevance
Further Reading
- Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57, 289–300.
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
- Bender, R. & Lange, S. (2001). Adjusting for multiple testing — when and how? Journal of Clinical Epidemiology, 54, 343–349.