Pairwise Model Comparison
How to compare two models or two responses head-to-head with human annotators, including multi-dimensional comparison and bias controls, using Potato.
To decide which of two models is better, show annotators a prompt and both responses and ask which one wins. Aggregated over many prompts, these head-to-head judgments rank models more reliably than absolute scores do. It is the method behind public model leaderboards built on human votes.
This is pairwise comparison applied to model outputs; many comparisons can be turned into a single ranking with an Elo or Bradley–Terry model.
Basic head-to-head
annotation_schemes:
- annotation_type: pairwise
name: which_better
description: "Which response is better overall?"
mode: binary
allow_tie: trueMulti-dimensional comparison
A single "better" hides trade-offs, model A is more accurate but model B is clearer. Compare on several dimensions at once:
annotation_schemes:
- annotation_type: pairwise
name: comparison
description: "Compare the two responses on each dimension."
mode: multi_dimension
dimensions: [accuracy, helpfulness, safety]
require_justification: trueA required justification makes the data auditable and surfaces cases where annotators rewarded the wrong thing.
Controlling for bias
Head-to-head data is only as good as its bias controls:
- Position bias: randomize which model is shown as "A"; annotators favor one side otherwise.
- Length/style bias: annotators often prefer longer or more confident text regardless of quality. Name it in the guidelines.
- Verbosity ≠ quality: consider capturing length so you can check whether it's driving wins.
- Agreement: collect overlap and track inter-annotator agreement.
Comparison vs. rubric
Use pairwise when you need a ranking and want high agreement. Use a rubric when you need an absolute, per-dimension profile of each model. Many evaluations run both.