Skip to content
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

Rubric-Based LLM Evaluation

How to evaluate LLM outputs against multiple weighted criteria (MT-Bench style) using Potato's rubric evaluation type.

A rubric breaks a vague judgment ("is this a good answer?") into specific, scored criteria, helpfulness, accuracy, completeness, tone, safety, each rated on a scale. It makes LLM evaluation repeatable and reveals why one answer beats another. This is the structure behind benchmarks like MT-Bench.

A rubric turns subjective quality into a grid of defined criteria and scale points, which raises agreement and makes results interpretable.

When to use a rubric

  • The output is rich enough that a single score loses information.
  • You need to know which dimension is weak (accuracy vs. tone), not just an overall verdict.
  • You want criteria stakeholders agree on up front.

If you only need "which is better", a pairwise comparison is cheaper. Rubrics shine when you need an absolute, multi-dimensional profile.

Setting it up in Potato

yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: answer_quality
    description: "Rate the answer on each criterion."
    scale_points: 5
    criteria:
      - {name: Helpfulness, description: "Does it address the user's actual need?"}
      - {name: Accuracy,    description: "Is it factually correct?"}
      - {name: Completeness, description: "Does it cover the important points?"}
      - {name: Tone,        description: "Is the style appropriate?"}

Potato renders this as a grid: criteria down the side, scale points across. Annotators score every cell.

Writing good criteria

  • Make them independent. Overlapping criteria ("helpful" and "useful") get scored together and add noise.
  • Anchor the scale. Describe what a 1 and a 5 look like for each criterion, not just the ends overall.
  • Keep it short. Four to six criteria is usually the sweet spot; long rubrics fatigue annotators and lower agreement.

Rubrics and LLM-as-judge

The same rubric you give humans can prompt an "LLM judge" for cheap pre-scoring, then have humans verify, exactly as in LLM pre-annotation. Keep a human-scored sample to check the judge against, and watch for the judge's own biases (length, formatting, self-preference).

Further reading