P-Values#
The p-value is one of the most commonly used — and most commonly misunderstood — concepts in statistics. A correct understanding is essential for interpreting any statistical test.
Definition#
The p-value is the probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true.
Formally:
A small p-value means: The observed result would be unlikely under the null hypothesis. This provides evidence against the null hypothesis.
The Significance Level α#
The significance level (alpha, α) is a threshold set in advance. In most disciplines:
The decision rule is:
- p < α → Result is statistically significant → Null hypothesis is rejected
- p ≥ α → Result is not significant → Null hypothesis cannot be rejected
Example: t-test with p = 0.03
A t-test comparing two groups yields p = 0.03.
Correct interpretation: Assuming there is no difference between the groups (H₀), we would obtain a result this extreme or more extreme in only 3% of cases. Since 0.03 < 0.05, the result is considered statistically significant.
Incorrect interpretation: "There is a 97% probability that the effect is real." — This is not correct!
Different Alpha Levels#
| Level | Label | Usage |
|---|---|---|
| α = 0.10 | Marginally significant | Exploratory studies |
| α = 0.05 | Significant | Standard in most fields |
| α = 0.01 | Highly significant | Stricter criteria |
| α = 0.001 | Very highly significant | Very conservative tests |
One-Tailed vs. Two-Tailed Tests#
- Two-tailed test: Tests whether a difference exists in either direction. Recommended by default.
- One-tailed test: Tests only one direction (e.g., "Group A is better than Group B"). The p-value is half as large.
Important: One-tailed tests should only be used when the direction of the effect was specified before data collection.
Multiple Testing#
When multiple tests are performed simultaneously, the probability of at least one false positive increases:
With 20 tests at α = 0.05, the probability of at least one error is already 64%.
Corrections:
- Bonferroni: — Simple but conservative
- Holm-Bonferroni: Stepwise correction, less conservative
- Benjamini-Hochberg: Controls the False Discovery Rate (FDR)
P-Value and Effect Size#
A significant p-value says nothing about the practical significance of an effect.
Example: Large sample, small effect
With n = 10,000 per group, a t-test finds a significant difference (p < 0.001) of 0.5 points on a 100-point scale. Statistically significant — but practically irrelevant.
This is why effect size should always be reported in addition to the p-value.
Common Misconceptions#
"The p-value is the probability that the null hypothesis is true." Wrong. The p-value says nothing about the probability of the hypothesis. It gives the probability of the data under the assumption of the null hypothesis.
"p = 0.05 means there is a 95% chance the effect is real." Wrong. The p-value is not a probability for the hypothesis, but for the data.
"A non-significant result proves that no effect exists." Wrong. A p > 0.05 only means that the evidence is not sufficient to reject the null hypothesis. The effect might still exist (lack of power).
"p = 0.049 and p = 0.051 are fundamentally different." Wrong. The difference is minimal. The threshold at 0.05 is a convention, not a law of nature. Interpretation should not hinge on a single cutoff.
"The smaller the p-value, the larger the effect." Wrong. The p-value depends on the effect size and the sample size. A tiny effect can be highly significant with a huge sample.
Further Reading
- Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd.
- Wasserstein, R. L. & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133.
- Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE.