Skip to content
هذه الصفحة غير متوفرة بلغتك بعد. يتم عرض النسخة الإنجليزية.

Pairwise and Best–Worst Scaling

When to use comparative judgments instead of ratings, pairwise comparison and best-worst scaling (MaxDiff), and how to set them up in Potato.

People are bad at giving stable absolute scores but good at comparing. Comparative annotation exploits this: instead of "rate this 1–5", you ask "which is better, A or B?". The two main forms are pairwise comparison and best–worst scaling. They are the backbone of preference data for modern AI.

See Pairwise comparison and MaxDiff for background.

Pairwise comparison

Show two items and ask which wins. It is simple, high-agreement, and the format used to collect human preference data for reinforcement learning from human feedback.

yaml
annotation_schemes:
  - annotation_type: pairwise
    name: preference
    description: "Which response better answers the question?"
    mode: binary
    allow_tie: true
    sequential_key_binding: true

Allowing ties keeps annotators from inventing a difference where none exists. To capture how much better, switch mode to a scale (e.g. "A much better … B much better"). The pairwise preference showcase is a working example.

Many pairwise judgments can be turned into a single ranking with a model such as the Elo rating system or the Bradley–Terry model.

Best–worst scaling (MaxDiff)

Show a small set (often four items) and ask for the best and the worst. Each judgment is more informative than a single pairwise vote, because it fixes both ends of the set at once.

yaml
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Pick the most and least fluent translation."
    tuple_size: 4
    best_label: "Most fluent"
    worst_label: "Least fluent"

Best–worst scaling produces reliable interval-like scores from simple choices and is widely used to build calibrated rankings from many annotators.

When to prefer comparisons over ratings

  • Your construct is hard to anchor absolutely (humor, helpfulness, aesthetic quality).
  • You need high agreement and your Likert scale is noisy.
  • You are building preference data to train or align a model.

The cost is that you get relative information; you may need a model (Elo, Bradley–Terry) to recover absolute scores.

Further reading