Fleiss' Kappa#

Fleiss' Kappa (κ) extends Cohen's Kappa to situations where three or more raters independently classify the same subjects into categories. Like Cohen's Kappa, it corrects for chance agreement, but it works with any number of raters. It is the standard measure when multiple people independently evaluate the same cases.

When to Use#

You have three or more raters who evaluate independently
Ratings are on a categorical scale (e.g., normal/abnormal/unclear)
Each subject is rated by a fixed number of raters
You want to quantify overall agreement across all raters
You need a chance-corrected agreement measure

Assumptions#

Fixed number of raters per subject
Categorical scale (nominal or ordinal)
Independent ratings (no mutual influence)
Each subject rated by the same number of raters

Formula#

Fleiss' Kappa compares the observed agreement $\bar{P}$ with the agreement expected by chance $\bar{P}_e$ :

\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}

For each subject $i$ , the pairwise agreement is computed as:

P_i = \frac{1}{n(n-1)} \sum_{j=1}^{k} n_{ij}(n_{ij} - 1)

where $n$ is the number of raters per subject, $k$ is the number of categories, and $n_{ij}$ is the number of raters who assigned subject $i$ to category $j$ . $\bar{P}$ is the mean of all $P_i$ , and $\bar{P}_e$ is derived from the overall category proportions:

\bar{P}_e = \sum_{j=1}^{k} p_j^2

where $p_j$ is the overall proportion of ratings in category $j$ .

Example#

Practical Example: X-Ray Assessment by Three Radiologists

Three radiologists independently evaluate 50 chest X-rays as normal, abnormal, or unclear. The results are recorded in a table showing, for each image, how many radiologists chose each category.

Image	Normal	Abnormal	Unclear
1	3	0	0
2	1	2	0
3	0	1	2
...	...	...	...
50	2	1	0

For Image 1, all three agree ( $P_1 = 1.0$ ); for Image 2, two agree ( $P_2 = 0.33$ ). After computing all $P_i$ values and the chance expectation, suppose we obtain $\kappa = 0.58$ , indicating moderate agreement.

Effect Size#

Fleiss' Kappa is interpreted the same way as Cohen's Kappa — it is itself an effect size measure. The standard benchmarks from Landis and Koch (1977):

Kappa Value	Interpretation
< 0.00	Poor (worse than chance)
0.00 – 0.20	Slight
0.21 – 0.40	Fair
0.41 – 0.60	Moderate
0.61 – 0.80	Substantial
0.81 – 1.00	Almost perfect

Note that Fleiss' Kappa values tend to be lower than Cohen's Kappa in practice, since achieving agreement across more raters is inherently more difficult.