The standard annotation pipeline is a machine for producing agreement. You write guidelines, train annotators, measure inter-annotator agreement, adjudicate the cases where people diverge, and ship a single gold label per item. Every step is designed to squeeze out disagreement, on the assumption that disagreement is error and error should be minimized. For a lot of tasks that assumption is fine. For a lot of others it quietly throws away the most interesting thing in the data.

When annotators disagree, the disagreement is sometimes error to resolve and sometimes genuine variation to keep. On objective tasks with a real correct answer, collapse to a gold label. On subjective or perspectival tasks, a single gold label erases a real distribution of human judgment, and you are better off keeping every annotator's label, storing a distribution instead of a winner, and measuring agreement without assuming that less than perfect means broken. This post is about telling the two cases apart and holding onto the disagreement when it matters.

The single-gold-label assumption

Most machine learning still assumes one correct interpretation exists for each item, which is why annotation defaults to aggregation: take three labels, take the majority, call it truth. Plank (2022) named this the "problem" of human label variation, in scare quotes, because the framing is the problem. Genuine variation in how people label is not always noise around a hidden true value. Sometimes there is no single true value, and the spread of answers is the honest description of the item.

The survey literature backs this up across a wide range of tasks. Uma and colleagues (2021) reviewed learning from disagreement across NLP and computer vision and found human disagreement everywhere, from part-of-speech tagging to natural language inference, along with a growing set of methods that learn from the disagreement instead of averaging it out. The perspectivist turn (Cabitza, Campagner, and Basile, 2021) pushes the point further: aggregating by majority vote can be actively misleading, and a better practice keeps the perspectives of the people who did the labeling.

Where disagreement comes from

Not all disagreement means the same thing, and the useful move is to ask where a given disagreement is coming from. Three sources cover most of it.

The guidelines. Two annotators read the same rule differently, or the rule does not cover the case in front of them. This disagreement is a defect, and the fix is to clarify the guideline, not to keep the spread. A pilot round exists to catch exactly this.
The annotator. Someone rushed, misread, or is a low-quality worker clicking through. This is error, and it should be caught and removed. It is not the same as genuine variation, and conflating the two is how "keep the disagreement" turns into "keep the noise."
The item. The text is genuinely ambiguous, or the judgment genuinely depends on who is reading. Is this joke offensive? Is this review positive or mixed? Here different answers are not mistakes. This is the disagreement worth keeping.

The skill is separating the third source from the first two. Guideline problems get fixed, annotator errors get filtered, and what remains, genuine item-level variation, is the signal.

A decision aid for annotator disagreement: trace it to its source. Guideline ambiguity gets fixed, annotator error gets filtered out, and genuine item-level or perspectival variation is kept as signal rather than resolved to a single label. Trace each disagreement to its source: fix the guideline, filter the error, keep the genuine variation

Objective task or subjective task

The cleanest rule of thumb is whether a knowledgeable, careful person could be certain of the answer. If yes, the task is objective, a gold label is meaningful, and disagreement is something to resolve. Whether a date is April 3rd or March 4th has an answer. Whether a sentence contains a named entity has an answer, most of the time.

If a knowledgeable, careful person could still land somewhere different for legitimate reasons, the task is subjective, and forcing a gold label invents a certainty the data does not have. Offensiveness, toxicity, humor, politeness, stance, image aesthetics: these depend on who is judging, and the variation between judges is often the property you actually care about. That is also where annotator demographics show up in the labels, which is the whole reason to collect and report them.

Most real projects are not purely one or the other. A practical approach measures agreement first, then reads it: high agreement means the task is behaving objectively and you can aggregate; stubbornly moderate agreement on a subjective task is not a failure to fix but a distribution to preserve.

What keeping disagreement looks like

Preserving disagreement is mostly a decision about what you store. Instead of one label per item, you keep the disaggregated labels: every annotator's judgment, tied to the annotator. From there you can build a soft label, a distribution over categories rather than a single winner, and train or evaluate against the distribution.

Two ways to handle multiple annotators' labels for one item: aggregate them to a single hard gold label, which discards the spread, or keep them disaggregated as a distribution that preserves how much genuine disagreement the item drew. Aggregate to one gold label and lose the spread, or keep the disaggregated distribution

This changes evaluation too. A model that predicts a distribution can be scored against the human distribution instead of a single label, so it is rewarded for being uncertain on the items where people are uncertain. On subjective tasks that is a more honest target than accuracy against a majority vote that half the annotators disagreed with.

None of this means abandoning inter-annotator agreement. You still measure agreement; you just stop treating any number below 1.0 as a defect to eliminate. Agreement tells you how objective the task is behaving. Whether to aggregate is a separate decision you make with that number in hand.

Doing it in Potato

Potato does not force consensus. When multiple annotators label the same item, their labels are stored per annotator, so the disaggregated data, the raw material for any distribution-based approach, is what you get by default. You choose whether to aggregate downstream, rather than losing the spread at collection time.

For tasks where the disagreement is really about degree, the soft_label type lets a single annotator express a distribution directly, distributing points across categories instead of picking one:

yaml

annotation_schemes:
  - annotation_type: soft_label
    name: emotion_mix
    description: Distribute 100 points to reflect how much each emotion applies.
    labels: ["Joy", "Sadness", "Anger", "Fear", "Surprise"]
    total: 100
    show_distribution_chart: true

For separating genuine ambiguity from annotator error, the two sources you most need to tell apart, MACE helps. It jointly estimates a competence score per annotator and an entropy per item, so a low-competence annotator (the error source) and a high-entropy item (the genuine-variation source) show up as different things rather than one undifferentiated pile of disagreement:

yaml

mace:
  enabled: true
  min_annotations_per_item: 3

An annotator sitting near 0.4 competence is probably clicking through and can be filtered. An item with high entropy across otherwise reliable annotators is genuinely contested, and that is the disagreement you keep. When a task really does need a single answer, adjudication is there for the objective cases, with MACE's predicted label as one more signal for the adjudicator. The point is that resolving disagreement becomes a choice you make per task, not the default the pipeline makes for you.

Where to go next

Collecting Annotator Demographics Responsibly, for why the variation between judges is often the signal.
Documenting Your Annotation Dataset, for reporting disaggregated labels and agreement together.
Inter-Annotator Agreement Explained, for measuring agreement without assuming disagreement is failure.
Adjudication and Resolving Disagreement, for the objective cases where a single label is the right call.

Subjective datasets show what preserved disagreement buys you: the fine-grained, contested emotion labels in GoEmotions and the social-norm judgments in Social Chemistry, where reasonable people genuinely disagree.