Fleiss' Kappa#
Fleiss' Kappa (ΞΊ) extends Cohen's Kappa to situations where three or more raters independently classify the same subjects into categories. Like Cohen's Kappa, it corrects for chance agreement, but it works with any number of raters. It is the standard measure when multiple people independently evaluate the same cases.
When to Use#
- You have three or more raters who evaluate independently
- Ratings are on a categorical scale (e.g., normal/abnormal/unclear)
- Each subject is rated by a fixed number of raters
- You want to quantify overall agreement across all raters
- You need a chance-corrected agreement measure
Assumptions#
- Fixed number of raters per subject
- Categorical scale (nominal or ordinal)
- Independent ratings (no mutual influence)
- Each subject rated by the same number of raters
Formula#
Fleiss' Kappa compares the observed agreement with the agreement expected by chance :
For each subject , the pairwise agreement is computed as:
where is the number of raters per subject, is the number of categories, and is the number of raters who assigned subject to category . is the mean of all , and is derived from the overall category proportions:
where is the overall proportion of ratings in category .
Example#
Practical Example: X-Ray Assessment by Three Radiologists
Three radiologists independently evaluate 50 chest X-rays as normal, abnormal, or unclear. The results are recorded in a table showing, for each image, how many radiologists chose each category.
| Image | Normal | Abnormal | Unclear |
|---|---|---|---|
| 1 | 3 | 0 | 0 |
| 2 | 1 | 2 | 0 |
| 3 | 0 | 1 | 2 |
| ... | ... | ... | ... |
| 50 | 2 | 1 | 0 |
For Image 1, all three agree (); for Image 2, two agree (). After computing all values and the chance expectation, suppose we obtain , indicating moderate agreement.
Effect Size#
Fleiss' Kappa is interpreted the same way as Cohen's Kappa β it is itself an effect size measure. The standard benchmarks from Landis and Koch (1977):
| Kappa Value | Interpretation |
|---|---|
| < 0.00 | Poor (worse than chance) |
| 0.00 β 0.20 | Slight |
| 0.21 β 0.40 | Fair |
| 0.41 β 0.60 | Moderate |
| 0.61 β 0.80 | Substantial |
| 0.81 β 1.00 | Almost perfect |
Note that Fleiss' Kappa values tend to be lower than Cohen's Kappa in practice, since achieving agreement across more raters is inherently more difficult.
Further Reading
- Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378β382.
- Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159β174.
- Gwet, K. L. (2014). Handbook of Inter-Rater Reliability (4th ed.). Advanced Analytics.