Validated Survey Instruments for Annotation Studies: Personality, Affect, Wellbeing, and Demographics
When who annotates matters, a validated questionnaire beats a question you invented. A tour of Potato's 55 built-in survey instruments and when each one earns a place in your study.
Once you accept that the people doing your annotation shape the labels, the next question is what to measure about them. Age and education are the obvious start, but for subjective tasks the interesting predictors are often further afield: personality, values, mood on the day, lived experience of the thing being judged. The temptation is to write a few quick questions and move on. That is usually a mistake, because a question you invent has no track record, no comparison group, and often a subtle wording flaw you will not notice until analysis.
When you want to measure something about your annotators, reach for a validated survey instrument before you write your own. Instruments like the Big Five, PANAS, or a standard demographic battery come with tested wordings, known reliability, and results comparable to a large body of prior work, none of which an ad-hoc question gives you. Potato ships 55 of them, usable in a prestudy or poststudy phase with a single config line. Collect only what you will analyze, treat mental-health screeners as sensitive, and get consent. This post is a tour of what is in the library and when each part earns its place.
Why not just write your own questions
A validated instrument is a questionnaire that researchers have tested for reliability (does it give consistent results?) and validity (does it measure what it claims?), usually across large samples and many studies. Borrowing one buys you three things a homemade question cannot: wordings that have been checked for ambiguity and bias, a scoring method with published norms, and comparability, because your numbers line up with everyone else who used the same instrument.
The cost of rolling your own shows up later. A gender question with the wrong options, a satisfaction scale that is subtly leading, a personality question that half your annotators read differently: each quietly adds noise or bias you cannot separate from signal. The instrument authors already paid that cost so you do not have to.
What you might measure, and why it shows up in labels
Not everything belongs in every study. Match the instrument to a plausible effect on your task.
- Demographics: who is annotating. The demographic batteries (ANES, GSS, ACS, and others) capture age, race, education, and the rest with standardized wordings. On offensiveness, toxicity, and politeness, these are the predictors with the most evidence behind them.
- Personality and values: how someone judges. The Big Five (Soto and John, 2017) and its ultra-brief cousin the Ten-Item Personality Inventory (Gosling et al., 2003) capture stable dispositions that can shape subjective ratings. The Moral Foundations Questionnaire (Graham et al., 2011) is a natural fit when the labels are moral judgments, since it measures the moral intuitions that drive them.
- Affect: mood at labeling time. The PANAS (Watson et al., 1988) measures positive and negative affect. Run it in a poststudy phase and you can check whether mood tracked the ratings, which matters for emotionally loaded content.
- Lived experience: standing to judge. The Everyday Discrimination Scale (Williams et al., 1997) measures day-to-day experience of discrimination. For tasks about offensiveness or hate directed at a group, whether an annotator has lived that is plausibly relevant to how they read it.
- Wellbeing: protecting the annotator. Screeners like the PHQ-9 (Kroenke et al., 2001) and GAD-7 are not about the labels at all. On projects with harmful or distressing content, a light-touch wellbeing check helps you notice strain, provided you handle the responses with the care they demand.
The 55-instrument library, grouped by category, with the annotation-relevant ones highlighted
The catch: sensitivity, burden, and consent
Measuring your annotators is not free of risk, and two of these categories carry real weight.
Mental-health screeners are sensitive personal data. A PHQ-9 score is not a diagnosis, and it should never be treated as one or used to exclude someone from work. If you run one, say why, keep it optional, store it separately from anything identifying, and have a plan for what a concerning score means before you collect it. When in doubt, this is an ethics-board conversation.
Length is its own tax. The Big Five Inventory-2 is 60 items; a full battery stack can take longer than the annotation. Every extra question costs completion and attention, so lean on the short forms (the 10-item TIPI, the 2-item PHQ-2) unless you specifically need the long version, and cut anything you will not actually analyze. As with demographics, the rule holds: if there is no comparison you plan to run with it, it does not go on the form.
Doing it in Potato
Potato includes a library of 55 validated instruments spanning personality, mental health, affect, social and political attitudes, and eight demographic batteries, all documented in Survey Instruments. You do not build these questionnaires; you name them.
Reference one instrument by ID in a prestudy or poststudy phase:
phases:
order: [consent, prestudy, annotation, poststudy]
prestudy:
type: prestudy
instrument: "tipi" # 10-item Big Five
poststudy:
type: poststudy
instrument: "panas" # affect, measured after the taskStack several with instruments:, and append your own study-specific questions after a battery:
phases:
prestudy:
type: prestudy
instruments:
- "gss-demographics" # standardized demographics
- "srh" # single self-rated health item
file: "surveys/study_specific.json" # appended after the instrumentsEach instrument carries its scoring metadata (method, reverse-coded items, range, and cutoffs), though Potato leaves the scoring to your analysis rather than computing it for you, which is the right call for anything clinical. The demographics-with-consent showcase puts the whole flow together: a consent gate, a standardized demographic battery in the prestudy phase, and a subjective rating task, so the annotator background lands next to the labels where you can analyze it.
Where to go next
- Collecting Annotator Demographics Responsibly, for the demographic batteries done right.
- Disagreement Is Signal, Not Noise, for why personality and values variation in labels is often what you want.
- Documenting Your Annotation Dataset, for reporting what you measured about your annotators.
- Survey Instruments, the full list of all 55 with IDs and item counts.