# How to Evaluate RAG Systems with Human Annotation

Source: https://www.potatoannotator.com/blog/rag-evaluation-with-human-annotation

A retrieval-augmented generation system can fail in two completely different ways, and a single quality score hides which one you are looking at. Either the retriever pulled the wrong documents, or the generator had good documents and ignored them. If you only measure "is the answer good," you cannot tell these apart, and you cannot tell which half of the system to fix.

The fix is to evaluate retrieval and generation separately, and to mark exactly where an answer departs from its sources.

## The three things worth annotating

1. **Retrieval relevance.** For each retrieved document, is it actually relevant to the query?
2. **Answer faithfulness.** Is the generated answer grounded in those documents, with no invented claims?
3. **Citation accuracy.** Do the answer's claims trace back to the sources it cites?

Keeping these separate turns a vague "the answer is wrong" into "the right document was retrieved but the model added a claim that isn't in it." That is a generation problem, and it points at a different fix than a retrieval failure would.

## Setting it up in Potato

You can put all three on one screen. Rate every retrieved document on the same scale with `multirate`, rate faithfulness with a Likert scale, and highlight problem spans in the answer.

```yaml
annotation_schemes:
  - annotation_type: multirate
    name: retrieval_relevance
    description: "Rate the relevance of each retrieved document to the query."
    labels: ["Irrelevant", "Somewhat", "Relevant", "Highly relevant"]

  - annotation_type: likert
    name: faithfulness
    description: "Is the answer faithful to the retrieved documents?"
    size: 5
    min_label: "Many unsupported claims"
    max_label: "Fully grounded"

  - annotation_type: span
    name: problems
    description: "Highlight any unsupported or incorrect claim in the answer."
    labels: [unsupported_claim, contradicted, hallucination]
```

The span scheme is what makes the data actionable. A faithfulness score of 2 out of 5 tells you something is wrong; a highlighted span tells you which sentence and why.

## Things that quietly wreck RAG evaluation

**Annotators can't judge faithfulness without the sources.** Show the query, the retrieved documents, and the answer on the same screen. If the documents are collapsed or on another tab, people will rate the answer on whether it *sounds* right, which is exactly the failure mode you are trying to catch.

**"Relevant" needs a definition.** Relevant to the query, or actually used in the answer? Those are different judgments and annotators will split on them unless you decide up front.

**Faithfulness is the subjective one.** Collect overlap on a sample and check agreement on the faithfulness ratings specifically. If agreement is low there, tighten the definition of "unsupported" before trusting the numbers.

## Where to go next

The full walkthrough, including how the three schemes fit together, is in the [RAG Evaluation guide](/docs/guides/rag-evaluation). For marking factual errors and hallucinations in any model output, not just RAG, see [Detecting Hallucinations with Span Annotation](/docs/guides/detecting-hallucinations). And if you are evaluating agents more broadly, start with [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents).
