Diese Seite ist in Ihrer Sprache noch nicht verfügbar. Englische Version wird angezeigt.

Rubric-Based LLM Evaluation

How to evaluate LLM outputs against multiple weighted criteria (MT-Bench style) using Potato's rubric evaluation type.

A rubric breaks a vague judgment ("is this a good answer?") into specific, scored criteria, helpfulness, accuracy, completeness, tone, safety, each rated on a scale. It makes LLM evaluation repeatable and reveals why one answer beats another. This is the structure behind benchmarks like MT-Bench.

A rubric turns subjective quality into a grid of defined criteria and scale points, which raises agreement and makes results interpretable.

When to use a rubric

The output is rich enough that a single score loses information.
You need to know which dimension is weak (accuracy vs. tone), not just an overall verdict.
You want criteria stakeholders agree on up front.

If you only need "which is better", a pairwise comparison is cheaper. Rubrics shine when you need an absolute, multi-dimensional profile.

Setting it up in Potato

yaml

annotation_schemes:
  - annotation_type: rubric_eval
    name: answer_quality
    description: "Rate the answer on each criterion."
    scale_points: 5
    criteria:
      - {name: Helpfulness, description: "Does it address the user's actual need?"}
      - {name: Accuracy,    description: "Is it factually correct?"}
      - {name: Completeness, description: "Does it cover the important points?"}
      - {name: Tone,        description: "Is the style appropriate?"}

Potato renders this as a grid: criteria down the side, scale points across. Annotators score every cell.

Writing good criteria

Make them independent. Overlapping criteria ("helpful" and "useful") get scored together and add noise.
Anchor the scale. Describe what a 1 and a 5 look like for each criterion, not just the ends overall.
Keep it short. Four to six criteria is usually the sweet spot; long rubrics fatigue annotators and lower agreement.

Rubrics and LLM-as-judge

The same rubric you give humans can prompt an "LLM judge" for cheap pre-scoring, then have humans verify, exactly as in LLM pre-annotation. Keep a human-scored sample to check the judge against, and watch for the judge's own biases (length, formatting, self-preference).

Rubric-Based LLM Evaluation

When to use a rubric

Setting it up in Potato

Writing good criteria

Rubrics and LLM-as-judge

Further reading