# RAG Evaluation

Source: https://www.potatoannotator.com/docs/guides/rag-evaluation

**Retrieval-augmented generation (RAG) answers a question by first retrieving documents and then generating an answer from them. Evaluating RAG means judging two things separately: did it retrieve the *right* documents, and is the answer actually *supported* by them?** Conflating the two hides where the system fails.

See [retrieval-augmented generation](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) for background.

## The three things to annotate

1. **Retrieval relevance**: for each retrieved document, is it relevant to the query?
2. **Answer faithfulness**: is the generated answer grounded in the retrieved documents, with no unsupported claims?
3. **Citation accuracy**: do the answer's claims actually trace to the cited sources?

## Setting it up in Potato

Combine three schemes on one screen, rate each document, rate faithfulness, and highlight problem spans in the answer:

```yaml
annotation_schemes:
  - annotation_type: multirate
    name: retrieval_relevance
    description: "Rate the relevance of each retrieved document to the query."
    labels: ["Irrelevant", "Somewhat", "Relevant", "Highly relevant"]

  - annotation_type: likert
    name: faithfulness
    description: "Is the answer faithful to the retrieved documents?"
    size: 5
    min_label: "Many unsupported claims"
    max_label: "Fully grounded"

  - annotation_type: span
    name: problems
    description: "Highlight any unsupported or incorrect claim in the answer."
    labels: [unsupported_claim, contradicted, hallucination]
```

`multirate` rates many documents on the same scale at once; the span scheme marks exactly *where* the answer departs from its sources. See [Detecting Hallucinations](/docs/guides/detecting-hallucinations).

## Why separate retrieval from generation

A RAG system can fail two ways: it retrieved bad context (a retrieval problem) or it ignored good context (a generation problem). Scoring them separately tells you which half to fix. A faithfulness score alone can't.

## Quality considerations

- Show annotators the query, the documents, and the answer together, faithfulness can't be judged without the sources.
- "Relevant" needs a definition: relevant to the query, or actually used in the answer? Decide up front.
- Track [agreement](/docs/guides/inter-annotator-agreement) on faithfulness; it's the most subjective of the three.

## Further reading

- [Detecting Hallucinations](/docs/guides/detecting-hallucinations)
- [Rubric-Based LLM Evaluation](/docs/guides/rubric-based-llm-evaluation)
- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents)