Rating Scales and Likert Design
How to design rating scales for annotation, Likert vs. sliders, how many points to use, avoiding acquiescence bias, and building rating tasks in Potato.
A rating scale captures degree, how positive, how fluent, how helpful, rather than a category. The two common forms are the discrete Likert scale (e.g. 1–5) and the continuous slider. Small design choices in a scale change your data more than people expect.
Likert: discrete points
Use a Likert scale when you want comparable, easy-to-aggregate ratings:
annotation_schemes:
- annotation_type: likert
name: fluency
description: "How fluent is this translation?"
size: 5
min_label: "Not fluent at all"
max_label: "Perfectly fluent"Design decisions that matter:
- How many points? Five is a safe default. Seven gives more resolution if annotators can use it. An even number removes the neutral midpoint and forces a lean, useful when "neutral" is a cop-out, risky when neutrality is real.
- Label the ends, and ideally every point. Labeled points are interpreted more consistently than bare numbers.
- Keep the direction consistent across all your scales so annotators don't flip them by habit.
Sliders: continuous values
Use a slider when the underlying quantity really is continuous, such as a confidence percentage or an emotion intensity:
annotation_schemes:
- annotation_type: slider
name: confidence
description: "How confident are you in your label?"
min: 0
max: 100
step: 1
min_label: "Guessing"
max_label: "Certain"Continuous scales give resolution but lower agreement, because people don't share a fine-grained sense of "67 vs. 72". Bin the output if you need agreement.
Biases to design around
- Acquiescence bias: a tendency to agree. Mix in reverse-worded items so agreement isn't the default. See acquiescence bias.
- Central tendency: clustering on the middle. Clear endpoint labels and, where appropriate, an even number of points push against it.
- Anchoring: the first few items set a reference. A short calibration set at the start helps.
Beyond a single scale
- Rate many items on the same scale at once with
multirate(e.g. each retrieved document). See RAG Evaluation. - Score several weighted criteria with
rubric_eval. See Rubric-Based LLM Evaluation. - Audio quality ratings such as MOS use the same Likert mechanism, see Audio Annotation.
Further reading
- Choosing an Annotation Scheme
- Pairwise and Best–Worst Scaling, when comparisons beat ratings
- Inter-Annotator Agreement