PickMyTest

Inter-Rater Reliability

How to measure and evaluate agreement between raters

Inter-Rater Reliability#

When two or more people judge the same thing β€” grading essays, making diagnoses, coding behaviors β€” a critical question arises: How well do the raters agree? Inter-rater reliability quantifies exactly this. Without it, we cannot tell whether our measurements depend on the subject being rated or simply on who is doing the rating.

Why Does It Matter?#

Imagine two physicians evaluate 100 X-ray images. If they agree in 90% of cases, that sounds good. But what if 80% of the images are clearly unremarkable and each physician simply checks "unremarkable"? Then the high agreement is largely due to chance (or rather, the base rate). This is exactly why we need chance-corrected measures.

Percent Agreement vs. Chance-Corrected Measures#

Why Raw Agreement Is Misleading

Two graders evaluate 100 essays as "pass" or "fail." Grader A says "pass" for 80 essays, Grader B likewise. They agree on 82 cases.

Percent agreement: 82% β€” sounds decent.

But if both independently say "pass" 80% of the time, we expect agreement by chance alone of:

Pe=(0.80Γ—0.80)+(0.20Γ—0.20)=0.64+0.04=0.68P_e = (0.80 \times 0.80) + (0.20 \times 0.20) = 0.64 + 0.04 = 0.68

That is 68% agreement by chance alone! The actual agreement beyond chance is much lower than the raw 82% suggests.

Cohen's Kappa:

ΞΊ=Poβˆ’Pe1βˆ’Pe=0.82βˆ’0.681βˆ’0.68=0.140.32=0.44\kappa = \frac{P_o - P_e}{1 - P_e} = \frac{0.82 - 0.68}{1 - 0.68} = \frac{0.14}{0.32} = 0.44

A Kappa of 0.44 β€” that is only "moderate" agreement, not "good."

Which Measure for Which Situation?#

Cohen's Kappa β€” Two Raters, Categorical Data#

The standard measure when exactly two raters make categorical judgments (yes/no, diagnosis A/B/C, etc.). Kappa corrects the observed agreement for the agreement expected by chance.

Typical Applications

  • Two clinicians classify patients (depression yes/no)
  • Two coders rate interview transcripts (category system)
  • Two teachers grade essays (pass/fail)

Weighted Kappa β€” Ordinal Categories#

When the categories have a natural order (e.g., "mild β€” moderate β€” severe"), you want "mild vs. severe" to be penalized more heavily than "mild vs. moderate." Weighted Kappa does exactly that.

Fleiss' Kappa β€” More Than Two Raters#

When three or more raters categorize the same objects, Fleiss' Kappa is the right choice. It extends Cohen's Kappa to the multi-rater case.

ICC (Intraclass Correlation) β€” Continuous Data#

When ratings are on a continuous scale (e.g., pain scale 0–10, points on an assessment), the ICC is the appropriate measure. Several ICC variants exist depending on whether raters are fixed or random and whether absolute agreement or consistency is of interest.

Interpretation Guidelines#

The following classification by Landis and Koch (1977) is most commonly cited:

Kappa / ICCInterpretation
< 0.00Poor (worse than chance)
0.00 – 0.20Slight
0.21 – 0.40Fair
0.41 – 0.60Moderate
0.61 – 0.80Substantial
0.81 – 1.00Almost perfect

A word of caution: these thresholds are arbitrary. In clinical contexts, a minimum of Kappa >= 0.60 is often required, while screening instruments typically demand >= 0.80.

Decision Guide#

QuestionAnswerMeasure
How many raters?2Cohen's Kappa / ICC
How many raters?3+Fleiss' Kappa / ICC
Scale level?NominalKappa
Scale level?OrdinalWeighted Kappa
Scale level?ContinuousICC

Practical Example#

Grading University Essays

Two lecturers independently grade 50 term papers on a 1–6 scale.

  1. Exact percent agreement: 38% β€” sounds poor
  2. Agreement within Β± 1 grade: 84% β€” much better
  3. ICC (two-way, absolute agreement): 0.72 β€” good agreement

Exact percent agreement with a 6-point scale is always low. The ICC accounts for how close the ratings are to each other and provides a more realistic picture.

Improving Agreement#

If inter-rater reliability is too low, the following steps usually help:

  1. Sharpen coding rules: Provide clear definitions and anchor examples for each category
  2. Training: Practice together on sample cases with subsequent discussion
  3. Pilot phase: Code a small sample first, check reliability, then refine
  4. Reduce complexity: Fewer categories often lead to higher agreement

Common Misconceptions#

  • "90% agreement is great." β€” Without chance correction, the raw percentage says little. With a skewed distribution, 90% agreement can correspond to a low Kappa.
  • "Kappa is the only measure." β€” For continuous data, the ICC is more appropriate. Kappa is designed only for categorical data.
  • "Low reliability means the raters are bad." β€” Sometimes the problem lies in the rating system: vague categories, too many levels, or unclear criteria make the task difficult even for experienced raters.
  • "Kappa cannot be negative." β€” It can. A negative Kappa means agreement is worse than chance β€” the raters are systematically contradicting each other.

Reporting#

Inter-rater reliability was assessed based on 50 independently double-coded cases. Cohen's Kappa was kappa = .73 (95% CI: .61–.85), indicating substantial agreement according to Landis and Koch (1977).

Further Reading

  • Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
  • Shrout, P. E. & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.
  • Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, 23–34.