Skip to content

Agreement for Spans and Structured Outputs

Why Cohen's and Fleiss' kappa break down for span, NER, and structured annotation, and what to use instead: F1-as-agreement, exact vs partial match, and Krippendorff's unitized alpha.

Chance-corrected agreement like Cohen's kappa assumes every annotator labels the same fixed set of items from the same fixed set of categories. Span annotation breaks that assumption: annotators can disagree on where a span starts, where it ends, and whether it exists at all. For spans, the standard reliability measure is pairwise F1, not kappa, and you have to decide up front whether a partial boundary overlap counts as agreement. This guide explains why the usual metrics fail here and what to report instead.

Why kappa doesn't fit spans

A chance-corrected coefficient needs three things: a fixed list of items, a fixed list of labels, and the ability to compute how often annotators would agree by chance. Span tasks supply none of them cleanly. There is no predetermined list of "items" to label, an annotator invents the spans as they read, so two people can produce different numbers of spans over the same document. And there is no meaningful negative class: the "items nobody marked" are every possible substring, an astronomically large and ill-defined set.

That last point is the killer. Hripcsak and Rothschild (2005) showed that when the negative class is very large or undefined, as in information retrieval and span extraction, the chance of two annotators agreeing on the same arbitrary span is effectively zero, so chance correction barely changes anything and the assumptions behind kappa no longer hold. Their result is the standard justification for a cleaner alternative: the F-measure itself is the appropriate agreement statistic. Treat one annotator's spans as the reference and the other's as predictions, compute F1, and average over all annotator pairs. Because F1 is symmetric, the pair order doesn't matter.

Exact match or partial match: decide before you measure

The number you report depends entirely on what counts as a hit, and there is no universal answer, so state your choice.

  • Exact match: two spans agree only if both boundaries are identical. Strict, and the right call when boundaries carry meaning (legal citations, chemical names).
  • Partial / overlap match: two spans agree if they overlap at all, or past some threshold. More forgiving, and reasonable when the presence of an entity matters more than its exact extent.
  • Boundary vs. label: for typed spans (NER), separate two questions, did annotators mark the same extent, and did they give it the same type? Reporting them together hides which one is actually causing your disagreement.

Artstein and Poesio (2008) is the standard survey of agreement for computational linguistics and works through this "unitizing" problem, disagreement about how to segment the text into units, in detail. It is the reference to cite when you need to defend a methodology choice.

When you do want a chance-corrected number

If you can reduce the task to a fixed set of units, chance correction becomes valid again. Two common reductions:

  • Token-level labeling: recast the span task as a label per token (the BIO scheme). Now every token is a fixed item with a small label set, and Fleiss' kappa or Krippendorff's alpha applies directly. The catch is that token-level agreement looks inflated, most tokens are the easy "outside" class, so a high number can hide real boundary disagreement.
  • Unitized alpha: Krippendorff (2004) developed a variant of alpha for exactly the case where annotators segment a continuum themselves. It is the principled option when you want a single chance-corrected reliability figure for segmentation, at the cost of more setup.

A practical middle path: report token-level kappa and span-level F1 together. The first tells you about label consistency, the second about boundary consistency, and the gap between them tells you which problem to fix.

Doing it in Potato

Potato computes Krippendorff's alpha automatically for categorical schemes, but for a span scheme, the document-level number hides boundary disagreement, so measure at the level you actually care about. The reliable recipe is to have annotators overlap on a shared subset, export their spans, and compute pairwise F1 yourself under your chosen match rule.

yaml
annotation_schemes:
  - name: pii_spans
    annotation_type: span
    description: "Highlight every span that reveals personal information."
    labels:
      - name: person
      - name: location
      - name: org
 
# Overlap a subset so agreement is measurable
automatic_assignment:
  on: true
  instance_per_annotator: 100
  labels_per_instance: 3

The export keeps each annotator's spans with their character offsets and labels, which is everything you need to compute exact-match or overlap F1 offline, and to split boundary agreement from type agreement. If your spans are typed, run the F1 twice, once ignoring type (boundary agreement) and once requiring type to match (full agreement).

Further reading