Most annotation projects treat the annotator as interchangeable: a label is a label, whoever produced it. For a lot of tasks that holds. For a lot of others it does not, and the moment you decide to find out who your annotators are, you have taken on a small research-ethics problem. Demographic data is some of the most sensitive information a person can hand you, and collecting it because it might be useful is not a good enough reason.

Annotator background shapes labels on subjective tasks, so demographics are often worth collecting, but only with informed consent, a clear reason for each field, an easy way to decline, and a plan to anonymize and report what you gather. Collect the minimum that answers your question, prefer standardized batteries over ad-hoc questions, and treat the demographics as something you will document, not just store. This post is about doing that well. The Potato config at the end shows the consent-then-demographics flow in practice.

Why who labels shows up in the labels

The clearest evidence that annotator identity matters comes from a dataset built for exactly this question. POPQUORN (Pei and Jurgens, 2023) collected 45,000 annotations from 1,484 annotators sampled to match the US population on sex, age, and race, then asked whether background predicts how people label. It does. Age, race, and education were statistically significant factors in offensiveness and politeness judgments; Black annotators, for one, rated the same comments as more offensive than other groups did. That is not noise to be averaged away. It is a real difference in how people read the same text.

The same comment shown to three annotator groups draws three different offensiveness ratings on a five-point scale; averaging them into one gold label of 3.3 hides the group pattern. Averaging divergent group ratings into a single gold label hides the pattern that demographics would reveal

This connects to a broader point about ground truth. Plank (2022) argues that human label variation is often genuine rather than error, and if the variation is genuine, then knowing who produced which label is part of understanding the data. On a subjective task, a single aggregated gold label quietly erases the disagreement that demographic information would let you see. (We go deeper on that in Disagreement is signal, not noise.)

So the case for collecting demographics is straightforward: if your task is at all subjective, the composition of your annotator pool is a property of your dataset, and you cannot report it or audit it if you never asked.

What to collect, and what to leave alone

The temptation is to ask everything and sort it out later. Resist it. Every demographic field you collect is a field you have to justify, secure, and eventually report, and some of them are legally sensitive: race, ethnicity, religion, political opinion, and health data are special categories under the GDPR that carry extra obligations. The default should be the smallest set that answers your actual question.

A useful test for each field: would a difference along this dimension plausibly change how someone labels your data, and would you actually analyze it? If you are annotating offensiveness, the POPQUORN result makes age, race, and education defensible. If you are annotating whether a sentence is grammatical, none of those belong on the form. Collecting an attribute you will never look at is not thoroughness; it is risk you took on for nothing.

Two practices keep this honest:

Tie every question to an analysis. Before a field goes on the form, write down the comparison you intend to run with it. No comparison, no field.
Make everything skippable. Sensitive questions need a real "prefer not to answer" option, not a required radio button. A person who feels forced to disclose either drops out or gives you a junk answer, and both are worse than a blank.

Demographic collection is where annotation stops being a data task and becomes human-subjects work. The baseline is informed consent: before anyone answers a demographic question, they should know what you are collecting, why, who sees it, and that they can stop at any point without penalty. This is not a formality you bury in a terms-of-service wall. It is a page the annotator reads and agrees to before the demographic questions load.

A few things that make consent real rather than nominal:

Voluntary participation, enforced by the interface. The right to decline only counts if declining is easy. "Prefer not to answer" on every sensitive item, and a way to leave the study without losing pay they have already earned.
Self-declared, not inferred. Demographics should come from the annotator, never guessed from their name, location, or writing. Inferred attributes are both wrong often and a worse privacy violation than asking.
Anonymized storage. Separate demographic responses from anything that identifies the person. You want to be able to say "raters who identified as X rated this higher" without being able to point at which individual that was.

If you are working through a university, this is usually an IRB conversation, and the IRB will care about exactly these points. If you are not, the points still hold.

An annotation intake flow: an informed-consent page gates a pre-study demographics survey, where every sensitive question offers "prefer not to answer," and responses are anonymized before they reach the main annotation task. Consent gates the demographics survey; every sensitive field is skippable, and responses are anonymized before annotation begins

Standardized batteries beat questions you invent

When you do collect a demographic, how you word it matters more than it seems. Ad-hoc questions produce categories that do not line up with anyone else's, cannot be compared across studies, and often frame options badly, most visibly on gender and race. The fix is to borrow from instruments that social scientists have already spent decades refining: the demographic batteries from the American National Election Studies (ANES) or the General Social Survey (GSS) give you question wordings and response options that are tested, defensible, and comparable to a large body of existing work.

Using a standard battery also does some of the ethics work for you. These instruments already include "prefer not to answer" options and have been reviewed for how they handle sensitive categories, so you are not reinventing a set of choices that a review board would flag.

Collect, then report

Collecting demographics and never mentioning them again defeats the purpose. The reason to gather this data is so you, and everyone who later uses the dataset, can see who produced the labels. That reporting has a standard form: a data statement (Bender and Friedman, 2018) includes an annotator-demographics section precisely so downstream users can judge how the data might generalize, and datasheets for datasets (Gebru et al.) ask the same of any ML dataset. Plan the release when you plan the collection: aggregate distributions, never individual records, and enough detail that a reader can tell whether your pool resembles the population your model will serve. We cover that end in Documenting your annotation dataset.

Doing it in Potato

Potato was partly built for this. POPQUORN is the "Potato-Prolific" dataset, collected by running Potato studies on Prolific, so the consent-and-demographics flow is a first-class feature rather than something you bolt on.

The intake is a multi-phase workflow: a consent phase that gates the study, then a prestudy phase that collects demographics, then the annotation itself.

yaml

phases:
  consent:
    enabled: true
    data_file: "data/consent.json"
 
  prestudy:
    enabled: true
    data_file: "data/demographics.json"
 
  # annotation phase is always enabled

The consent page is a question with a right_label, which is the answer required to proceed. Nobody reaches the demographics or the task without agreeing first.

json

[
  {
    "name": "consent_agreement",
    "type": "radio",
    "description": "I have read the consent form, understand my responses are anonymized, and agree to participate. I may stop at any time.",
    "labels": ["I agree", "I do not agree"],
    "right_label": "I agree",
    "required": true
  }
]

For the demographics themselves, give every sensitive question a "prefer not to answer" option and lean on the built-in templates for the fiddly categories:

json

[
  {
    "name": "age_range",
    "type": "radio",
    "description": "What is your age range?",
    "labels": ["18-24", "25-34", "35-44", "45-54", "55+", "Prefer not to answer"]
  },
  {
    "name": "ethnicity",
    "type": "select",
    "description": "Which best describes you? (optional)",
    "template": "ethnicity",
    "free_response": true,
    "free_response_label": "Prefer to self-describe"
  }
]

If you would rather not hand-write the questions at all, Potato ships validated survey instruments, including eight standardized demographic batteries. Pointing a prestudy phase at ANES or GSS demographics gives you the tested wordings for free:

yaml

phases:
  prestudy:
    type: prestudy
    instrument: "anes-demographics"   # or gss-demographics, acs-demographics, ...

The demographics-with-consent showcase is a ready-to-run version of this whole flow, and validated survey instruments covers the wider library if you want to measure more than demographics.

Once the study runs, the demographic responses are stored per annotator alongside their labels, which is what lets you do the analysis that justified collecting them: break agreement down by group, and check whether a demographic predicts the labels the way POPQUORN found. Potato reports Cohen's and Fleiss' kappa over the annotation, so "does group membership move the labels" becomes a measurement rather than a hunch. When you release the data, the aggregate distributions from the prestudy phase are the annotator-demographics section of your data statement, already collected.

Where to go next

Disagreement is signal, not noise, for why demographic variation in labels is often the thing you want to keep.
Documenting your annotation dataset, for turning collected demographics into a data statement or datasheet.
Inter-Annotator Agreement Explained, for the statistics you use to analyze labels by group.
Running Crowdsourced Studies on Prolific and MTurk, for recruiting a demographically balanced pool in the first place.

Why who labels shows up in the labels

What to collect, and what to leave alone

Getting consent right

Standardized batteries beat questions you invent

Collect, then report

Doing it in Potato

Where to go next