Intraclass Correlation Coefficient (ICC)#

The Intraclass Correlation Coefficient (ICC) is the standard measure of agreement between two or more raters when measurements are on a continuous scale. While Cohen's and Fleiss' Kappa are designed for categorical data, the ICC captures how consistently raters score on a numerical scale. It is based on an analysis of variance decomposition and indicates what proportion of total variance is due to actual differences between subjects.

When to Use#

You have continuous (metric) measurements, e.g., scores on a 1–100 scale
Two or more raters evaluate the same subjects
You want to assess both consistency and absolute agreement
You need a measure that is comparable across different study designs
You want to distinguish between different sources of error (raters, subjects, residual)

Assumptions#

Continuous (metric) measurements
Independent subjects
Raters representative of a larger population (for ICC(2))
Approximately normally distributed residuals

ICC Variants#

There are several ICC forms that differ in their assumptions. The most important ones are:

ICC(1,1): Each subject is rated by a random subset of raters. Rarely used in practice.
ICC(2,1): Each subject is rated by all raters, who are considered a random sample from a larger population. This is the most common variant — it accounts for both systematic differences between raters and random error.
ICC(3,1): Each subject is rated by all raters, but the raters are the only ones of interest (fixed effect). Systematic differences between raters are partialed out. Appropriate when results should only apply to these specific raters.

The suffixes ",1" and ",k" indicate whether reliability is reported for a single measurement or for the mean across k raters.

Formula#

The basic form of the ICC (for ICC(2,1) with absolute agreement) is:

\text{ICC} = \frac{MS_{between} - MS_{within}}{MS_{between} + (k - 1) \cdot MS_{within}}

where $MS_{between}$ is the mean square between subjects, $MS_{within}$ is the mean square within subjects, and $k$ is the number of raters. For ICC(2,1), an additional term accounts for the rater effect:

\text{ICC}(2,1) = \frac{MS_{between} - MS_{error}}{MS_{between} + (k - 1) \cdot MS_{error} + \frac{k}{n}(MS_{raters} - MS_{error})}

Example#

Practical Example: Rating Presentations

Three judges rate the quality of 20 student presentations on a scale from 1 to 100. Each judge rates every presentation independently.

Presentation	Judge 1	Judge 2	Judge 3
1	72	68	75
2	85	82	88
3	45	50	42
...	...	...	...
20	91	87	93

An analysis of variance yields $MS_{between} = 420$ , $MS_{error} = 25$ , and $MS_{raters} = 30$ . The ICC(2,1) is approximately 0.84, indicating good reliability. The judges rate quite consistently — the differences in scores largely reflect genuine quality differences among the presentations.

Effect Size#

The ICC is itself an effect size measure. The widely used benchmarks from Koo and Li (2016):

ICC Value	Interpretation
< 0.50	Poor
0.50 – 0.75	Moderate
0.75 – 0.90	Good
> 0.90	Excellent

Important: The ICC should always be reported alongside its 95% confidence interval, as the point estimate alone can be misleading, especially with small sample sizes.