Skip to content
هذه الصفحة غير متوفرة بلغتك بعد. يتم عرض النسخة الإنجليزية.

Pairwise Model Comparison

How to compare two models or two responses head-to-head with human annotators, including multi-dimensional comparison and bias controls, using Potato.

To decide which of two models is better, show annotators a prompt and both responses and ask which one wins. Aggregated over many prompts, these head-to-head judgments rank models more reliably than absolute scores do. It is the method behind public model leaderboards built on human votes.

This is pairwise comparison applied to model outputs; many comparisons can be turned into a single ranking with an Elo or Bradley–Terry model.

Basic head-to-head

yaml
annotation_schemes:
  - annotation_type: pairwise
    name: which_better
    description: "Which response is better overall?"
    mode: binary
    allow_tie: true

Multi-dimensional comparison

A single "better" hides trade-offs, model A is more accurate but model B is clearer. Compare on several dimensions at once:

yaml
annotation_schemes:
  - annotation_type: pairwise
    name: comparison
    description: "Compare the two responses on each dimension."
    mode: multi_dimension
    dimensions: [accuracy, helpfulness, safety]
    require_justification: true

A required justification makes the data auditable and surfaces cases where annotators rewarded the wrong thing.

Controlling for bias

Head-to-head data is only as good as its bias controls:

  • Position bias: randomize which model is shown as "A"; annotators favor one side otherwise.
  • Length/style bias: annotators often prefer longer or more confident text regardless of quality. Name it in the guidelines.
  • Verbosity ≠ quality: consider capturing length so you can check whether it's driving wins.
  • Agreement: collect overlap and track inter-annotator agreement.

Comparison vs. rubric

Use pairwise when you need a ranking and want high agreement. Use a rubric when you need an absolute, per-dimension profile of each model. Many evaluations run both.

Further reading