Skip to content
Diese Seite ist in Ihrer Sprache noch nicht verfügbar. Englische Version wird angezeigt.

Writing Effective Annotation Guidelines

How to write an annotation codebook that produces consistent labels, clear definitions, worked examples, edge-case rules, and a pilot-and-revise loop.

Annotation guidelines (a "codebook") are the rulebook annotators follow. The single biggest driver of label quality is whether the guidelines make two careful people choose the same label, not how detailed they are. A short, sharp codebook with good examples beats a long vague one.

This connects to content analysis in the social sciences, where codebook design has been studied for decades, and it is what ultimately determines your inter-rater reliability.

What a good codebook contains

  • A one-sentence definition per label, plus a line on what the label is not. Negative definitions prevent the most common disagreements.
  • Worked examples, including borderline ones. Real items from your data beat invented ones.
  • Tie-breaker rules. When two labels could apply, say which wins. This is where consistency is won or lost.
  • A "when in doubt" default and an escape hatch (a "can't tell" option) so annotators don't guess silently.

Put the rules where annotators see them

Potato can show instructions and per-label tooltips right next to the controls, so the guideline lives at the point of decision rather than in a separate document:

yaml
annotation_schemes:
  - annotation_type: radio
    name: toxicity
    description: "Is this comment toxic? Toxic = rude, disrespectful, or likely to make someone leave a conversation."
    labels: [Toxic, Not toxic, Can't tell]
    tooltips:
      Toxic: "Insults, threats, identity attacks, or harassment."
      Not toxic: "Disagreement or strong opinion without an attack."
      Can't tell: "Not enough context to judge."

The "Can't tell" option matters: it separates genuine ambiguity from forced guesses, which keeps your agreement metrics honest.

Pilot, then revise

Guidelines are never right on the first try. Run a small pilot, look at every item where annotators disagreed, and decide whether the guideline was unclear or the item was genuinely ambiguous. Fix the guideline and re-pilot. Two or three rounds usually stabilizes the label set.

Potato's gold standards and attention checks let you encode the settled cases as checks so future annotators stay calibrated.

Further reading