PickMyTest

Intraclass Correlation Coefficient (ICC)

The Intraclass Correlation Coefficient (ICC) measures inter-rater reliability for continuous measurements, quantifying how consistent ratings are across raters.

Intraclass Correlation Coefficient (ICC)#

The Intraclass Correlation Coefficient (ICC) is the standard measure of agreement between two or more raters when measurements are on a continuous scale. While Cohen's and Fleiss' Kappa are designed for categorical data, the ICC captures how consistently raters score on a numerical scale. It is based on an analysis of variance decomposition and indicates what proportion of total variance is due to actual differences between subjects.

When to Use#

  • You have continuous (metric) measurements, e.g., scores on a 1–100 scale
  • Two or more raters evaluate the same subjects
  • You want to assess both consistency and absolute agreement
  • You need a measure that is comparable across different study designs
  • You want to distinguish between different sources of error (raters, subjects, residual)

Assumptions#

  • Continuous (metric) measurements
  • Independent subjects
  • Raters representative of a larger population (for ICC(2))
  • Approximately normally distributed residuals

ICC Variants#

There are several ICC forms that differ in their assumptions. The most important ones are:

  • ICC(1,1): Each subject is rated by a random subset of raters. Rarely used in practice.
  • ICC(2,1): Each subject is rated by all raters, who are considered a random sample from a larger population. This is the most common variant β€” it accounts for both systematic differences between raters and random error.
  • ICC(3,1): Each subject is rated by all raters, but the raters are the only ones of interest (fixed effect). Systematic differences between raters are partialed out. Appropriate when results should only apply to these specific raters.

The suffixes ",1" and ",k" indicate whether reliability is reported for a single measurement or for the mean across k raters.

Formula#

The basic form of the ICC (for ICC(2,1) with absolute agreement) is:

ICC=MSbetweenβˆ’MSwithinMSbetween+(kβˆ’1)β‹…MSwithin\text{ICC} = \frac{MS_{between} - MS_{within}}{MS_{between} + (k - 1) \cdot MS_{within}}

where MSbetweenMS_{between} is the mean square between subjects, MSwithinMS_{within} is the mean square within subjects, and kk is the number of raters. For ICC(2,1), an additional term accounts for the rater effect:

ICC(2,1)=MSbetweenβˆ’MSerrorMSbetween+(kβˆ’1)β‹…MSerror+kn(MSratersβˆ’MSerror)\text{ICC}(2,1) = \frac{MS_{between} - MS_{error}}{MS_{between} + (k - 1) \cdot MS_{error} + \frac{k}{n}(MS_{raters} - MS_{error})}

Example#

Practical Example: Rating Presentations

Three judges rate the quality of 20 student presentations on a scale from 1 to 100. Each judge rates every presentation independently.

PresentationJudge 1Judge 2Judge 3
1726875
2858288
3455042
............
20918793

An analysis of variance yields MSbetween=420MS_{between} = 420, MSerror=25MS_{error} = 25, and MSraters=30MS_{raters} = 30. The ICC(2,1) is approximately 0.84, indicating good reliability. The judges rate quite consistently β€” the differences in scores largely reflect genuine quality differences among the presentations.

Effect Size#

The ICC is itself an effect size measure. The widely used benchmarks from Koo and Li (2016):

ICC ValueInterpretation
< 0.50Poor
0.50 – 0.75Moderate
0.75 – 0.90Good
> 0.90Excellent

Important: The ICC should always be reported alongside its 95% confidence interval, as the point estimate alone can be misleading, especially with small sample sizes.

Further Reading

  • Shrout, P. E. & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
  • Koo, T. K. & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163.
  • McGraw, K. O. & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.