# Finding Hallucinations with Span Annotation

Source: https://www.potatoannotator.com/blog/finding-hallucinations-with-span-annotation

When a model makes something up, a thumbs-down on the whole answer tells you almost nothing. You know it's wrong somewhere. You don't know which sentence, what kind of error, or how bad it is. Span annotation fixes that: the annotator highlights the exact words and labels what's wrong with them.

This is the same idea behind MQM, the error-span framework that machine-translation evaluation has used for years. Mark the span, categorize the error, rate the severity. The result is data you can actually act on.

## Why spans beat whole-answer flags

A whole-answer "unfaithful" label is a summary statistic. A span is a location and a diagnosis. With spans you can measure error rates per type, spot patterns across many outputs, and build targeted training data for the failure mode you care about. None of that is possible when the unit of judgment is the entire response.

## Setting it up in Potato

Highlight the problem text, label the error type, and add a severity judgment so a trivial slip and a dangerous fabrication don't get weighted the same.

```yaml
annotation_schemes:
  - annotation_type: span
    name: errors
    description: "Highlight each problematic span and label the error type."
    labels: [unsupported_claim, factual_error, contradiction, fabricated_citation]
    label_colors:
      unsupported_claim: "#f59e0b"
      factual_error: "#ef4444"
      contradiction: "#8b5cf6"
      fabricated_citation: "#ec4899"
  - annotation_type: radio
    name: severity
    description: "How serious is the worst error?"
    labels: [Minor, Major, Critical]
```

## The rules that decide your data quality

Give annotators the source material. "Unsupported" is undefinable without it, so the documents or context have to be on screen, not behind a tab.

Decide your boundary rule once. Does the span cover the whole sentence or just the false clause? Both are defensible; pick one and write it down.

Expect subjectivity at the edges. Faithfulness judgments diverge on borderline cases, so collect overlap on a sample and check agreement before trusting the numbers.

## Where to go next

The full walkthrough, including how to define each error type, is in the [Detecting Hallucinations guide](/docs/guides/detecting-hallucinations). For the retrieval-grounded version of this problem, see [RAG Evaluation](/docs/guides/rag-evaluation) and the [span annotation guide](/docs/guides/span-annotation). For implementation details, see the [error span source documentation](https://github.com/davidjurgens/potato/blob/master/docs/annotation-types/text/error_span.md).
