# Pairwise Model Comparison

Source: https://www.potatoannotator.com/docs/guides/pairwise-model-comparison

**To decide which of two models is better, show annotators a prompt and both responses and ask which one wins. Aggregated over many prompts, these head-to-head judgments rank models more reliably than absolute scores do.** It is the method behind public model leaderboards built on human votes.

This is [pairwise comparison](https://en.wikipedia.org/wiki/Pairwise_comparison) applied to model outputs; many comparisons can be turned into a single ranking with an [Elo](https://en.wikipedia.org/wiki/Elo_rating_system) or [Bradley–Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) model.

## Basic head-to-head

```yaml
annotation_schemes:
  - annotation_type: pairwise
    name: which_better
    description: "Which response is better overall?"
    mode: binary
    allow_tie: true
```

## Multi-dimensional comparison

A single "better" hides trade-offs, model A is more accurate but model B is clearer. Compare on several dimensions at once:

```yaml
annotation_schemes:
  - annotation_type: pairwise
    name: comparison
    description: "Compare the two responses on each dimension."
    mode: multi_dimension
    dimensions: [accuracy, helpfulness, safety]
    require_justification: true
```

A required justification makes the data auditable and surfaces cases where annotators rewarded the wrong thing.

## Controlling for bias

Head-to-head data is only as good as its bias controls:

- **Position bias**: randomize which model is shown as "A"; annotators favor one side otherwise.
- **Length/style bias**: annotators often prefer longer or more confident text regardless of quality. Name it in the guidelines.
- **Verbosity ≠ quality**: consider capturing length so you can check whether it's driving wins.
- **Agreement**: collect overlap and track [inter-annotator agreement](/docs/guides/inter-annotator-agreement).

## Comparison vs. rubric

Use pairwise when you need a *ranking* and want high agreement. Use a [rubric](/docs/guides/rubric-based-llm-evaluation) when you need an absolute, per-dimension profile of each model. Many evaluations run both.

## Further reading

- [Pairwise and Best–Worst Scaling](/docs/guides/pairwise-and-best-worst)
- [Rubric-Based LLM Evaluation](/docs/guides/rubric-based-llm-evaluation)
- [Collecting RLHF and Preference Data](/docs/guides/rlhf-preference-data)