P-Values#

The p-value is one of the most commonly used — and most commonly misunderstood — concepts in statistics. A correct understanding is essential for interpreting any statistical test.

Definition#

The p-value is the probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true.

Formally:

p = P(\text{Data} \geq \text{observed result} \mid H_0 \text{ is true})

A small p-value means: The observed result would be unlikely under the null hypothesis. This provides evidence against the null hypothesis.

The Significance Level α#

The significance level (alpha, α) is a threshold set in advance. In most disciplines:

\alpha = 0.05

The decision rule is:

p < α → Result is statistically significant → Null hypothesis is rejected
p ≥ α → Result is not significant → Null hypothesis cannot be rejected

Example: t-test with p = 0.03

A t-test comparing two groups yields p = 0.03.

Correct interpretation: Assuming there is no difference between the groups (H₀), we would obtain a result this extreme or more extreme in only 3% of cases. Since 0.03 < 0.05, the result is considered statistically significant.

Incorrect interpretation: "There is a 97% probability that the effect is real." — This is not correct!

Different Alpha Levels#

Level	Label	Usage
α = 0.10	Marginally significant	Exploratory studies
α = 0.05	Significant	Standard in most fields
α = 0.01	Highly significant	Stricter criteria
α = 0.001	Very highly significant	Very conservative tests

One-Tailed vs. Two-Tailed Tests#

Two-tailed test: Tests whether a difference exists in either direction. Recommended by default.
One-tailed test: Tests only one direction (e.g., "Group A is better than Group B"). The p-value is half as large.

p_{\text{one-tailed}} = \frac{p_{\text{two-tailed}}}{2}

Important: One-tailed tests should only be used when the direction of the effect was specified before data collection.

Multiple Testing#

When multiple tests are performed simultaneously, the probability of at least one false positive increases:

P(\text{at least one error}) = 1 - (1 - \alpha)^m

With 20 tests at α = 0.05, the probability of at least one error is already 64%.

Corrections:

Bonferroni: $\alpha_{\text{adjusted}} = \frac{\alpha}{m}$ — Simple but conservative
Holm-Bonferroni: Stepwise correction, less conservative
Benjamini-Hochberg: Controls the False Discovery Rate (FDR)

P-Value and Effect Size#

A significant p-value says nothing about the practical significance of an effect.

Example: Large sample, small effect

With n = 10,000 per group, a t-test finds a significant difference (p < 0.001) of 0.5 points on a 100-point scale. Statistically significant — but practically irrelevant.

This is why effect size should always be reported in addition to the p-value.

Common Misconceptions#

"The p-value is the probability that the null hypothesis is true." Wrong. The p-value says nothing about the probability of the hypothesis. It gives the probability of the data under the assumption of the null hypothesis.

"p = 0.05 means there is a 95% chance the effect is real." Wrong. The p-value is not a probability for the hypothesis, but for the data.

"A non-significant result proves that no effect exists." Wrong. A p > 0.05 only means that the evidence is not sufficient to reject the null hypothesis. The effect might still exist (lack of power).

"p = 0.049 and p = 0.051 are fundamentally different." Wrong. The difference is minimal. The threshold at 0.05 is a convention, not a law of nature. Interpretation should not hinge on a single cutoff.

"The smaller the p-value, the larger the effect." Wrong. The p-value depends on the effect size and the sample size. A tiny effect can be highly significant with a huge sample.

P-Values