Aggregating Crowd Labels: Beyond Majority Vote
How to combine many noisy annotations into one label using annotator models like Dawid-Skene and MACE, when to trust them, and how Potato estimates competence and infers labels.
When several people label the same item, majority vote is the obvious way to combine their answers and usually the wrong one. Models that estimate each annotator's reliability recover better labels, flag spammers, and tell you how confident to be. But every one of them assumes a single correct answer, so on subjective tasks you have to decide first whether the disagreement is error to remove or signal to keep. This guide covers the main aggregation models, the assumption they share, and how to run one in Potato.
The problem majority vote pretends doesn't exist
Collect three labels for an item and take the majority. It works when annotators are roughly equal and mostly right. It breaks the moment they aren't. Majority vote counts a careful expert and a bot clicking randomly as one vote each, throws away the vote split (a 2-1 win and a 3-0 sweep come out the same), and gives you no way to tell a genuinely hard item from a lazy annotator. This is the truth inference problem: recover the latent true label and each annotator's reliability at the same time, from nothing but the label matrix.
Confusion-matrix models: Dawid and Skene
The foundational method is nearly 50 years old. Dawid and Skene (1979) modeled each annotator with a confusion matrix, the probabilities that they label a true-positive item positive, negative, and so on, and used expectation-maximization to estimate those matrices and the true labels jointly. An annotator who confuses two categories gets a confusion matrix that says so, and their vote on that distinction is downweighted accordingly. Almost every modern aggregation model is a descendant of this idea.
MACE: competence and spam detection
Hovy et al. (2013) introduced MACE (Multi-Annotator Competence Estimation), which adds an explicit spamming model: each annotator is treated as either knowing the answer or guessing, and MACE estimates the probability they were guessing on each item. That gives you a single competence score per annotator between 0 and 1, plus a per-item entropy that flags genuinely ambiguous items. It is fast, it is good at catching random clickers, and it is the model Potato ships.
Bayesian models and the survey evidence
The space has grown well past these two. Paun et al. (2018) compared a family of Bayesian annotation models on real datasets and found they consistently beat majority vote, especially when annotators vary a lot in quality, while also giving calibrated uncertainty you can propagate downstream. On the engineering side, Zheng et al. (2017) benchmarked 17 truth-inference methods and asked whether the problem is solved. The short answer: no single method wins everywhere, but nearly all of them beat majority vote, and the gap grows as label quality drops.
The assumption all of these share
Every model above assumes there is one true label and disagreement is error. For objective tasks that is fine. For subjective ones it is exactly wrong: on offensiveness, emotion, or moral judgment, two annotators can disagree because they genuinely read the text differently, and Plank (2022) argues that this human label variation is often signal, not noise. Aggregate it away and you have thrown out the thing that made the data interesting. (We go deeper on this in Disagreement is signal, not noise.)
This is where knowing who annotated starts to matter. NUTMEG (Ivey, Gauch, and Jurgens, 2025) is a Bayesian model built for exactly this tension: it uses annotator background information to separate legitimate, systematic disagreement from noise, removing careless labels from training data while preserving the disagreement that tracks who the annotator is. That only works if you collected the background in the first place. If you run a prestudy demographics survey (see collecting annotator demographics responsibly and Potato's survey instruments), you have the annotator metadata that a NUTMEG-style model needs; without it, you are stuck treating every disagreement as either all error or all signal.
Doing it in Potato
Potato runs MACE over your multi-annotator data and reports competence and inferred labels in the admin dashboard. It works on categorical schemes (radio, likert, select, multiselect) and needs real overlap, several annotators per item, to have anything to estimate.
mace:
enabled: true
trigger_every_n: 10 # re-estimate after every 10 new annotations
min_annotations_per_item: 3 # ignore items with fewer than 3 labels
min_items: 5 # wait for at least 5 eligible itemsAfter it runs, each annotator gets a competence score (near 1.0 is reliable, below 0.5 is likely a spammer) and each item gets a predicted label plus an entropy value. Low entropy means the model is confident; entropy near its maximum means no consensus, which usually flags a genuinely hard or under-specified item rather than a bad annotator. Full options are in the MACE feature reference.
Two practical notes. First, aggregate on overlap you actually collected, MACE needs multiple labels per item, so plan the overlap before the study, not after. Second, MACE gives you a single label; if your task is subjective, consider keeping the distribution instead with a soft_label scheme, and reach for adjudication only where you truly need one answer.
When to aggregate and when to keep the spread
A rough decision rule:
- Objective task, real answer key exists → aggregate to one label. Use MACE or majority vote and move on.
- Objective-ish, but some annotators are unreliable → aggregate with a competence model (MACE), not plain majority vote, so bad raters don't sway the result.
- Subjective task, disagreement is meaningful → keep the full distribution (
soft_label), and if you have annotator metadata, model the disagreement rather than deleting it.
Further reading
- MACE Competence Estimation, the feature reference with API endpoints and interpretation.
- Adjudication and Disagreement, for resolving the cases you decide to collapse.
- Inter-Annotator Agreement Explained, for measuring how much your annotators diverge before you aggregate.
- How Many Annotators Do You Need?, for the overlap that makes aggregation possible.