Skip to content
Esta página aún no está disponible en su idioma. Se muestra la versión en inglés.

RAG Evaluation

How to evaluate retrieval-augmented generation with human annotation, retrieval relevance, answer faithfulness, and citation spans, using Potato.

Retrieval-augmented generation (RAG) answers a question by first retrieving documents and then generating an answer from them. Evaluating RAG means judging two things separately: did it retrieve the right documents, and is the answer actually supported by them? Conflating the two hides where the system fails.

See retrieval-augmented generation for background.

The three things to annotate

  1. Retrieval relevance: for each retrieved document, is it relevant to the query?
  2. Answer faithfulness: is the generated answer grounded in the retrieved documents, with no unsupported claims?
  3. Citation accuracy: do the answer's claims actually trace to the cited sources?

Setting it up in Potato

Combine three schemes on one screen, rate each document, rate faithfulness, and highlight problem spans in the answer:

yaml
annotation_schemes:
  - annotation_type: multirate
    name: retrieval_relevance
    description: "Rate the relevance of each retrieved document to the query."
    labels: ["Irrelevant", "Somewhat", "Relevant", "Highly relevant"]
 
  - annotation_type: likert
    name: faithfulness
    description: "Is the answer faithful to the retrieved documents?"
    size: 5
    min_label: "Many unsupported claims"
    max_label: "Fully grounded"
 
  - annotation_type: span
    name: problems
    description: "Highlight any unsupported or incorrect claim in the answer."
    labels: [unsupported_claim, contradicted, hallucination]

multirate rates many documents on the same scale at once; the span scheme marks exactly where the answer departs from its sources. See Detecting Hallucinations.

Why separate retrieval from generation

A RAG system can fail two ways: it retrieved bad context (a retrieval problem) or it ignored good context (a generation problem). Scoring them separately tells you which half to fix. A faithfulness score alone can't.

Quality considerations

  • Show annotators the query, the documents, and the answer together, faithfulness can't be judged without the sources.
  • "Relevant" needs a definition: relevant to the query, or actually used in the answer? Decide up front.
  • Track agreement on faithfulness; it's the most subjective of the three.

Further reading