Rubric-Based LLM Evaluation
How to evaluate LLM outputs against multiple weighted criteria (MT-Bench style) using Potato's rubric evaluation type.
A rubric breaks a vague judgment ("is this a good answer?") into specific, scored criteria, helpfulness, accuracy, completeness, tone, safety, each rated on a scale. It makes LLM evaluation repeatable and reveals why one answer beats another. This is the structure behind benchmarks like MT-Bench.
A rubric turns subjective quality into a grid of defined criteria and scale points, which raises agreement and makes results interpretable.
When to use a rubric
- The output is rich enough that a single score loses information.
- You need to know which dimension is weak (accuracy vs. tone), not just an overall verdict.
- You want criteria stakeholders agree on up front.
If you only need "which is better", a pairwise comparison is cheaper. Rubrics shine when you need an absolute, multi-dimensional profile.
Setting it up in Potato
annotation_schemes:
- annotation_type: rubric_eval
name: answer_quality
description: "Rate the answer on each criterion."
scale_points: 5
criteria:
- {name: Helpfulness, description: "Does it address the user's actual need?"}
- {name: Accuracy, description: "Is it factually correct?"}
- {name: Completeness, description: "Does it cover the important points?"}
- {name: Tone, description: "Is the style appropriate?"}Potato renders this as a grid: criteria down the side, scale points across. Annotators score every cell.
Writing good criteria
- Make them independent. Overlapping criteria ("helpful" and "useful") get scored together and add noise.
- Anchor the scale. Describe what a 1 and a 5 look like for each criterion, not just the ends overall.
- Keep it short. Four to six criteria is usually the sweet spot; long rubrics fatigue annotators and lower agreement.
Rubrics and LLM-as-judge
The same rubric you give humans can prompt an "LLM judge" for cheap pre-scoring, then have humans verify, exactly as in LLM pre-annotation. Keep a human-scored sample to check the judge against, and watch for the judge's own biases (length, formatting, self-preference).