# Rating Scales and Likert Design

Source: https://www.potatoannotator.com/docs/guides/rating-scales

**A rating scale captures *degree*, how positive, how fluent, how helpful, rather than a category. The two common forms are the discrete [Likert scale](https://en.wikipedia.org/wiki/Likert_scale) (e.g. 1–5) and the continuous [slider](https://en.wikipedia.org/wiki/Visual_analogue_scale).** Small design choices in a scale change your data more than people expect.

## Likert: discrete points

Use a Likert scale when you want comparable, easy-to-aggregate ratings:

```yaml
annotation_schemes:
  - annotation_type: likert
    name: fluency
    description: "How fluent is this translation?"
    size: 5
    min_label: "Not fluent at all"
    max_label: "Perfectly fluent"
```

Design decisions that matter:

- **How many points?** Five is a safe default. Seven gives more resolution if annotators can use it. An even number removes the neutral midpoint and forces a lean, useful when "neutral" is a cop-out, risky when neutrality is real.
- **Label the ends, and ideally every point.** Labeled points are interpreted more consistently than bare numbers.
- **Keep the direction consistent** across all your scales so annotators don't flip them by habit.

## Sliders: continuous values

Use a `slider` when the underlying quantity really is continuous, such as a confidence percentage or an emotion intensity:

```yaml
annotation_schemes:
  - annotation_type: slider
    name: confidence
    description: "How confident are you in your label?"
    min: 0
    max: 100
    step: 1
    min_label: "Guessing"
    max_label: "Certain"
```

Continuous scales give resolution but lower agreement, because people don't share a fine-grained sense of "67 vs. 72". Bin the output if you need agreement.

## Biases to design around

- **Acquiescence bias**: a tendency to agree. Mix in reverse-worded items so agreement isn't the default. See [acquiescence bias](https://en.wikipedia.org/wiki/Acquiescence_bias).
- **Central tendency**: clustering on the middle. Clear endpoint labels and, where appropriate, an even number of points push against it.
- **Anchoring**: the first few items set a reference. A short calibration set at the start helps.

## Beyond a single scale

- Rate many items on the same scale at once with `multirate` (e.g. each retrieved document). See [RAG Evaluation](/docs/guides/rag-evaluation).
- Score several weighted criteria with `rubric_eval`. See [Rubric-Based LLM Evaluation](/docs/guides/rubric-based-llm-evaluation).
- Audio quality ratings such as [MOS](https://en.wikipedia.org/wiki/Mean_opinion_score) use the same Likert mechanism, see [Audio Annotation](/docs/guides/audio-annotation).

## Further reading

- [Choosing an Annotation Scheme](/docs/guides/choosing-an-annotation-scheme)
- [Pairwise and Best–Worst Scaling](/docs/guides/pairwise-and-best-worst), when comparisons beat ratings
- [Inter-Annotator Agreement](/docs/guides/inter-annotator-agreement)
