Aggregating Crowd Labels: Beyond Majority Vote

How to combine many noisy annotations into one label using annotator models like Dawid-Skene and MACE, when to trust them, and how Potato estimates competence and infers labels.

When several people label the same item, majority vote is the obvious way to combine their answers and usually the wrong one. Models that estimate each annotator's reliability recover better labels, flag spammers, and tell you how confident to be. But every one of them assumes a single correct answer, so on subjective tasks you have to decide first whether the disagreement is error to remove or signal to keep. This guide covers the main aggregation models, the assumption they share, and how to run one in Potato.

The problem majority vote pretends doesn't exist

Collect three labels for an item and take the majority. It works when annotators are roughly equal and mostly right. It breaks the moment they aren't. Majority vote counts a careful expert and a bot clicking randomly as one vote each, throws away the vote split (a 2-1 win and a 3-0 sweep come out the same), and gives you no way to tell a genuinely hard item from a lazy annotator. This is the truth inference problem: recover the latent true label and each annotator's reliability at the same time, from nothing but the label matrix.

Confusion-matrix models: Dawid and Skene

The foundational method is nearly 50 years old. Dawid and Skene (1979) modeled each annotator with a confusion matrix, the probabilities that they label a true-positive item positive, negative, and so on, and used expectation-maximization to estimate those matrices and the true labels jointly. An annotator who confuses two categories gets a confusion matrix that says so, and their vote on that distinction is downweighted accordingly. Almost every modern aggregation model is a descendant of this idea.

MACE: competence and spam detection

Hovy et al. (2013) introduced MACE (Multi-Annotator Competence Estimation), which adds an explicit spamming model: each annotator is treated as either knowing the answer or guessing, and MACE estimates the probability they were guessing on each item. That gives you a single competence score per annotator between 0 and 1, plus a per-item entropy that flags genuinely ambiguous items. It is fast, it is good at catching random clickers, and it is the model Potato ships.

Bayesian models and the survey evidence

The space has grown well past these two. Paun et al. (2018) compared a family of Bayesian annotation models on real datasets and found they consistently beat majority vote, especially when annotators vary a lot in quality, while also giving calibrated uncertainty you can propagate downstream. On the engineering side, Zheng et al. (2017) benchmarked 17 truth-inference methods and asked whether the problem is solved. The short answer: no single method wins everywhere, but nearly all of them beat majority vote, and the gap grows as label quality drops.

Every model above assumes there is one true label and disagreement is error. For objective tasks that is fine. For subjective ones it is exactly wrong: on offensiveness, emotion, or moral judgment, two annotators can disagree because they genuinely read the text differently, and Plank (2022) argues that this human label variation is often signal, not noise. Aggregate it away and you have thrown out the thing that made the data interesting. (We go deeper on this in Disagreement is signal, not noise.)

This is where knowing who annotated starts to matter. NUTMEG (Ivey, Gauch, and Jurgens, 2025) is a Bayesian model built for exactly this tension: it uses annotator background information to separate legitimate, systematic disagreement from noise, removing careless labels from training data while preserving the disagreement that tracks who the annotator is. That only works if you collected the background in the first place. If you run a prestudy demographics survey (see collecting annotator demographics responsibly and Potato's survey instruments), you have the annotator metadata that a NUTMEG-style model needs; without it, you are stuck treating every disagreement as either all error or all signal.

Doing it in Potato

Potato runs MACE over your multi-annotator data and reports competence and inferred labels in the admin dashboard. It works on categorical schemes (radio, likert, select, multiselect) and needs real overlap, several annotators per item, to have anything to estimate.

yaml

mace:
  enabled: true
  trigger_every_n: 10            # re-estimate after every 10 new annotations
  min_annotations_per_item: 3    # ignore items with fewer than 3 labels
  min_items: 5                   # wait for at least 5 eligible items

After it runs, each annotator gets a competence score (near 1.0 is reliable, below 0.5 is likely a spammer) and each item gets a predicted label plus an entropy value. Low entropy means the model is confident; entropy near its maximum means no consensus, which usually flags a genuinely hard or under-specified item rather than a bad annotator. Full options are in the MACE feature reference.

Two practical notes. First, aggregate on overlap you actually collected, MACE needs multiple labels per item, so plan the overlap before the study, not after. Second, MACE gives you a single label; if your task is subjective, consider keeping the distribution instead with a soft_label scheme, and reach for adjudication only where you truly need one answer.

When to aggregate and when to keep the spread

A rough decision rule:

Objective task, real answer key exists → aggregate to one label. Use MACE or majority vote and move on.
Objective-ish, but some annotators are unreliable → aggregate with a competence model (MACE), not plain majority vote, so bad raters don't sway the result.
Subjective task, disagreement is meaningful → keep the full distribution (soft_label), and if you have annotator metadata, model the disagreement rather than deleting it.