Intraclass Correlation Coefficient (ICC)#
The Intraclass Correlation Coefficient (ICC) is the standard measure of agreement between two or more raters when measurements are on a continuous scale. While Cohen's and Fleiss' Kappa are designed for categorical data, the ICC captures how consistently raters score on a numerical scale. It is based on an analysis of variance decomposition and indicates what proportion of total variance is due to actual differences between subjects.
When to Use#
- You have continuous (metric) measurements, e.g., scores on a 1β100 scale
- Two or more raters evaluate the same subjects
- You want to assess both consistency and absolute agreement
- You need a measure that is comparable across different study designs
- You want to distinguish between different sources of error (raters, subjects, residual)
Assumptions#
- Continuous (metric) measurements
- Independent subjects
- Raters representative of a larger population (for ICC(2))
- Approximately normally distributed residuals
ICC Variants#
There are several ICC forms that differ in their assumptions. The most important ones are:
- ICC(1,1): Each subject is rated by a random subset of raters. Rarely used in practice.
- ICC(2,1): Each subject is rated by all raters, who are considered a random sample from a larger population. This is the most common variant β it accounts for both systematic differences between raters and random error.
- ICC(3,1): Each subject is rated by all raters, but the raters are the only ones of interest (fixed effect). Systematic differences between raters are partialed out. Appropriate when results should only apply to these specific raters.
The suffixes ",1" and ",k" indicate whether reliability is reported for a single measurement or for the mean across k raters.
Formula#
The basic form of the ICC (for ICC(2,1) with absolute agreement) is:
where is the mean square between subjects, is the mean square within subjects, and is the number of raters. For ICC(2,1), an additional term accounts for the rater effect:
Example#
Practical Example: Rating Presentations
Three judges rate the quality of 20 student presentations on a scale from 1 to 100. Each judge rates every presentation independently.
| Presentation | Judge 1 | Judge 2 | Judge 3 |
|---|---|---|---|
| 1 | 72 | 68 | 75 |
| 2 | 85 | 82 | 88 |
| 3 | 45 | 50 | 42 |
| ... | ... | ... | ... |
| 20 | 91 | 87 | 93 |
An analysis of variance yields , , and . The ICC(2,1) is approximately 0.84, indicating good reliability. The judges rate quite consistently β the differences in scores largely reflect genuine quality differences among the presentations.
Effect Size#
The ICC is itself an effect size measure. The widely used benchmarks from Koo and Li (2016):
| ICC Value | Interpretation |
|---|---|
| < 0.50 | Poor |
| 0.50 β 0.75 | Moderate |
| 0.75 β 0.90 | Good |
| > 0.90 | Excellent |
Important: The ICC should always be reported alongside its 95% confidence interval, as the point estimate alone can be misleading, especially with small sample sizes.
Further Reading
- Shrout, P. E. & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420β428.
- Koo, T. K. & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155β163.
- McGraw, K. O. & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30β46.