Skip to content
Cette page n'est pas encore disponible dans votre langue. La version anglaise est affichée.

Inter-Annotator Agreement Explained

A practical guide to inter-annotator agreement, percent agreement, Cohen's and Fleiss' kappa, and Krippendorff's alpha, when to use each, and how Potato reports them.

Inter-annotator agreement (IAA) measures how often independent annotators give the same label. It is the standard evidence that an annotation task is well-defined and the resulting labels are trustworthy. Low agreement usually means the guidelines are unclear, not that the annotators are careless.

The general topic is inter-rater reliability. Potato computes agreement live in the admin dashboard, see Quality Control.

Why raw percent agreement isn't enough

The simplest measure is percent agreement: the fraction of items annotators labeled identically. The problem is that some agreement happens by chance. If two annotators both pick "positive" 90% of the time, they will agree most of the time even labeling at random. Chance-corrected measures fix this.

A chance-corrected coefficient has the form:

text
        P_observed − P_expected
  κ =  ─────────────────────────
            1 − P_expected

where P_observed is the actual agreement and P_expected is the agreement expected by chance. A value of 1 is perfect agreement; 0 is chance-level.

The three you'll actually use

  • Cohen's kappa: two annotators, categorical labels. The classic choice for a pair.
  • Fleiss' kappa: more than two annotators, categorical labels, when different items may be judged by different raters.
  • Krippendorff's alpha: the most general option. It works with any number of annotators, handles missing data, and supports nominal, ordinal, interval, and ratio data. This is what Potato reports by default.

Use kappa for simple categorical pairs; reach for Krippendorff's alpha when you have many annotators, incomplete overlap, or ordered/continuous labels (where "off by one" should count less than "off by four").

Interpreting the number

There is no universal cutoff, but a common rough guide for alpha/kappa:

  • ≥ 0.80: good enough to rely on.
  • 0.67–0.80: usable for tentative conclusions; investigate disagreements.
  • < 0.67: revisit the guidelines before trusting the labels.

Treat these as prompts to investigate, not pass/fail gates. Always look at which items and which labels drive disagreement.

Measuring it in Potato

Have annotators overlap on a shared subset, then enable agreement reporting:

yaml
agreement_metrics:
  enabled: true
  # Krippendorff's alpha is reported in the admin dashboard.

For span and structured tasks, measure agreement at the level you care about (exact span match vs. overlap), because document-level metrics hide boundary disagreements.

When agreement is low

  1. Read the disagreed items, is the guideline ambiguous or the item genuinely hard?
  2. Tighten definitions and add the hard cases as examples. See Writing Annotation Guidelines.
  3. Re-pilot. If agreement stays low on truly subjective tasks, consider capturing the disagreement itself rather than forcing a single answer.

Further reading