Cohen's Kappa#
Cohen's Kappa (ΞΊ) is a statistical measure that quantifies the agreement between exactly two raters who classify the same set of subjects into categories. Unlike simple percent agreement, Kappa accounts for the proportion of agreement expected by chance alone. This makes it a far more realistic indicator of true inter-rater agreement.
When to Use#
- You have exactly two raters who evaluate independently
- Ratings are on a categorical scale (e.g., healthy/sick, Type A/B/C)
- You want to know whether agreement exceeds chance level
- Both raters evaluate the same set of subjects (patients, images, texts, etc.)
- You need a single, easy-to-interpret measure of agreement
Assumptions#
- Exactly 2 raters
- Both raters evaluate the same subjects
- Categorical scale (nominal or ordinal)
- Independent ratings (no mutual influence)
Formula#
Cohen's Kappa is calculated from the observed agreement and the agreement expected by chance :
Here, is the proportion of cases where both raters agree, and is the proportion of agreement expected under random assignment. is derived from the marginal distributions of the contingency table:
where and are the relative frequencies of category for Rater 1 and Rater 2, respectively.
Example#
Practical Example: Diagnosis by Two Doctors
Two doctors independently examine 100 patients and classify each as healthy or sick. The results:
| Doctor 2: healthy | Doctor 2: sick | Total | |
|---|---|---|---|
| Doctor 1: healthy | 40 | 10 | 50 |
| Doctor 1: sick | 5 | 45 | 50 |
| Total | 45 | 55 | 100 |
Observed agreement:
Expected agreement:
Kappa:
With , this indicates substantial agreement.
Effect Size#
Cohen's Kappa is itself an effect size measure. The most widely used interpretation comes from Landis and Koch (1977):
| Kappa Value | Interpretation |
|---|---|
| < 0.00 | Poor (worse than chance) |
| 0.00 β 0.20 | Slight |
| 0.21 β 0.40 | Fair |
| 0.41 β 0.60 | Moderate |
| 0.61 β 0.80 | Substantial |
| 0.81 β 1.00 | Almost perfect |
A of 1 means perfect agreement, a of 0 corresponds to pure chance agreement, and negative values indicate systematic disagreement.
Further Reading
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37β46.
- Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159β174.
- McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276β282.