# Rubric-Based LLM Evaluation

Source: https://www.potatoannotator.com/docs/guides/rubric-based-llm-evaluation

**A rubric breaks a vague judgment ("is this a good answer?") into specific, scored criteria, helpfulness, accuracy, completeness, tone, safety, each rated on a scale. It makes LLM evaluation repeatable and reveals *why* one answer beats another.** This is the structure behind benchmarks like [MT-Bench](https://arxiv.org/abs/2306.05685).

A [rubric](https://en.wikipedia.org/wiki/Rubric_(academic)) turns subjective quality into a grid of defined criteria and scale points, which raises agreement and makes results interpretable.

## When to use a rubric

- The output is rich enough that a single score loses information.
- You need to know which dimension is weak (accuracy vs. tone), not just an overall verdict.
- You want criteria stakeholders agree on up front.

If you only need "which is better", a [pairwise comparison](/docs/guides/pairwise-model-comparison) is cheaper. Rubrics shine when you need an absolute, multi-dimensional profile.

## Setting it up in Potato

```yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: answer_quality
    description: "Rate the answer on each criterion."
    scale_points: 5
    criteria:
      - {name: Helpfulness, description: "Does it address the user's actual need?"}
      - {name: Accuracy,    description: "Is it factually correct?"}
      - {name: Completeness, description: "Does it cover the important points?"}
      - {name: Tone,        description: "Is the style appropriate?"}
```

Potato renders this as a grid: criteria down the side, scale points across. Annotators score every cell.

## Writing good criteria

- **Make them independent.** Overlapping criteria ("helpful" and "useful") get scored together and add noise.
- **Anchor the scale.** Describe what a 1 and a 5 look like for each criterion, not just the ends overall.
- **Keep it short.** Four to six criteria is usually the sweet spot; long rubrics fatigue annotators and lower agreement.

## Rubrics and LLM-as-judge

The same rubric you give humans can prompt an "LLM judge" for cheap pre-scoring, then have humans verify, exactly as in [LLM pre-annotation](/docs/guides/llm-pre-annotation). Keep a human-scored sample to check the judge against, and watch for the judge's own biases (length, formatting, self-preference).

## Further reading

- [Pairwise Model Comparison](/docs/guides/pairwise-model-comparison)
- [Collecting RLHF and Preference Data](/docs/guides/rlhf-preference-data)
- [Rating Scales](/docs/guides/rating-scales)
