# Pairwise and Best–Worst Scaling

Source: https://www.potatoannotator.com/docs/guides/pairwise-and-best-worst

**People are bad at giving stable absolute scores but good at comparing. Comparative annotation exploits this: instead of "rate this 1–5", you ask "which is better, A or B?".** The two main forms are pairwise comparison and best–worst scaling. They are the backbone of preference data for modern AI.

See [Pairwise comparison](https://en.wikipedia.org/wiki/Pairwise_comparison) and [MaxDiff](https://en.wikipedia.org/wiki/MaxDiff) for background.

## Pairwise comparison

Show two items and ask which wins. It is simple, high-agreement, and the format used to collect human preference data for [reinforcement learning from human feedback](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback).

```yaml
annotation_schemes:
  - annotation_type: pairwise
    name: preference
    description: "Which response better answers the question?"
    mode: binary
    allow_tie: true
    sequential_key_binding: true
```

Allowing ties keeps annotators from inventing a difference where none exists. To capture *how much* better, switch `mode` to a scale (e.g. "A much better … B much better"). The [pairwise preference showcase](/showcase/pairwise-preference) is a working example.

Many pairwise judgments can be turned into a single ranking with a model such as the [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system) or the [Bradley–Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model).

## Best–worst scaling (MaxDiff)

Show a small set (often four items) and ask for the **best** and the **worst**. Each judgment is more informative than a single pairwise vote, because it fixes both ends of the set at once.

```yaml
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Pick the most and least fluent translation."
    tuple_size: 4
    best_label: "Most fluent"
    worst_label: "Least fluent"
```

Best–worst scaling produces reliable interval-like scores from simple choices and is widely used to build calibrated rankings from many annotators.

## When to prefer comparisons over ratings

- Your construct is hard to anchor absolutely (humor, helpfulness, aesthetic quality).
- You need high agreement and your Likert scale is noisy.
- You are building preference data to train or align a model.

The cost is that you get *relative* information; you may need a model (Elo, Bradley–Terry) to recover absolute scores.

## Further reading

- [Rating Scales](/docs/guides/rating-scales), the absolute-scoring alternative
- [Pairwise Model Comparison](/docs/guides/pairwise-model-comparison), comparing AI outputs
- [RLHF Preference Data](/docs/guides/rlhf-preference-data)