Skip to content
Diese Seite ist in Ihrer Sprache noch nicht verfügbar. Englische Version wird angezeigt.

Text Annotation

A complete guide to text annotation, classification, multi-label tagging, rating, and free-text, and how to build each kind of text task in Potato with copy-paste config.

Text annotation means labeling written language: sorting documents into categories, tagging the topics in an article, rating a passage for quality, or writing a correction. It is the most common annotation task in natural language processing, and it is what Potato was first built for. This guide covers the whole-document text tasks; for marking regions inside text, see Span Annotation.

The text tasks at a glance

  • Document classification: one label for the whole text (text classification).
  • Multi-label tagging: several labels at once, such as topics or content warnings.
  • Rating and scoring: a position on a scale, such as quality or sentiment intensity.
  • Free-text: a written answer, paraphrase, or correction.

Classification: one label per document

The workhorse of text annotation. Use radio when the categories are mutually exclusive:

yaml
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the overall sentiment of this review?"
    labels: [Positive, Negative, Neutral]
    sequential_key_binding: true

sequential_key_binding maps the labels to keys 1, 2, 3, so annotators keep their hands on the keyboard. On a job of thousands of items this is a large speed-up. See the live sentiment analysis design for a working example.

Multi-label: several tags at once

When more than one label can apply, use multiselect. Bound the selection count to match your guidelines:

yaml
annotation_schemes:
  - annotation_type: multiselect
    name: content_warnings
    description: "Select every content warning that applies."
    labels: [Violence, Profanity, Sexual content, Self-harm, None]
    min_selections: 1
    max_selections: 5

Content moderation is a classic multi-label text task; the toxicity detection design combines a category with a highlighted span.

Rating text on a scale

To capture degree rather than category, use a Likert scale:

yaml
annotation_schemes:
  - annotation_type: likert
    name: helpfulness
    description: "How helpful is this answer?"
    size: 5
    min_label: "Not helpful"
    max_label: "Very helpful"

See Rating Scales for scale-design pitfalls such as acquiescence bias and how many points to use.

Free-text and corrections

Sometimes the most useful label is a sentence the annotator writes, a justification, a rewrite, or a transcription. Combine it with a category and show it only when relevant:

yaml
annotation_schemes:
  - annotation_type: radio
    name: factuality
    description: "Is the claim supported by the source?"
    labels: [Supported, Contradicted, Not enough info]
  - annotation_type: text
    name: evidence
    description: "Quote the sentence that supports your choice."
    label_requirement:
      required: false

Getting consistent text labels

Text is ambiguous, so consistency comes from the surrounding process, not the interface:

  1. Write tight guidelines with a "can't tell" option.
  2. Have multiple annotators overlap on the same items.
  3. Track inter-annotator agreement and adjudicate disagreements.
  4. Speed up large jobs with LLM pre-annotation and verify the suggestions by hand.

Further reading