PickMyTest

Fleiss' Kappa

Fleiss' Kappa extends Cohen's Kappa to three or more raters, measuring inter-rater agreement for categorical data while correcting for chance.

Fleiss' Kappa#

Fleiss' Kappa (ΞΊ) extends Cohen's Kappa to situations where three or more raters independently classify the same subjects into categories. Like Cohen's Kappa, it corrects for chance agreement, but it works with any number of raters. It is the standard measure when multiple people independently evaluate the same cases.

When to Use#

  • You have three or more raters who evaluate independently
  • Ratings are on a categorical scale (e.g., normal/abnormal/unclear)
  • Each subject is rated by a fixed number of raters
  • You want to quantify overall agreement across all raters
  • You need a chance-corrected agreement measure

Assumptions#

  • Fixed number of raters per subject
  • Categorical scale (nominal or ordinal)
  • Independent ratings (no mutual influence)
  • Each subject rated by the same number of raters

Formula#

Fleiss' Kappa compares the observed agreement Pˉ\bar{P} with the agreement expected by chance Pˉe\bar{P}_e:

ΞΊ=PΛ‰βˆ’PΛ‰e1βˆ’PΛ‰e\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}

For each subject ii, the pairwise agreement is computed as:

Pi=1n(nβˆ’1)βˆ‘j=1knij(nijβˆ’1)P_i = \frac{1}{n(n-1)} \sum_{j=1}^{k} n_{ij}(n_{ij} - 1)

where nn is the number of raters per subject, kk is the number of categories, and nijn_{ij} is the number of raters who assigned subject ii to category jj. Pˉ\bar{P} is the mean of all PiP_i, and Pˉe\bar{P}_e is derived from the overall category proportions:

PΛ‰e=βˆ‘j=1kpj2\bar{P}_e = \sum_{j=1}^{k} p_j^2

where pjp_j is the overall proportion of ratings in category jj.

Example#

Practical Example: X-Ray Assessment by Three Radiologists

Three radiologists independently evaluate 50 chest X-rays as normal, abnormal, or unclear. The results are recorded in a table showing, for each image, how many radiologists chose each category.

ImageNormalAbnormalUnclear
1300
2120
3012
............
50210

For Image 1, all three agree (P1=1.0P_1 = 1.0); for Image 2, two agree (P2=0.33P_2 = 0.33). After computing all PiP_i values and the chance expectation, suppose we obtain ΞΊ=0.58\kappa = 0.58, indicating moderate agreement.

Effect Size#

Fleiss' Kappa is interpreted the same way as Cohen's Kappa β€” it is itself an effect size measure. The standard benchmarks from Landis and Koch (1977):

Kappa ValueInterpretation
< 0.00Poor (worse than chance)
0.00 – 0.20Slight
0.21 – 0.40Fair
0.41 – 0.60Moderate
0.61 – 0.80Substantial
0.81 – 1.00Almost perfect

Note that Fleiss' Kappa values tend to be lower than Cohen's Kappa in practice, since achieving agreement across more raters is inherently more difficult.

Further Reading

  • Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
  • Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
  • Gwet, K. L. (2014). Handbook of Inter-Rater Reliability (4th ed.). Advanced Analytics.