# Collecting RLHF and Preference Data

Source: https://www.potatoannotator.com/docs/guides/rlhf-preference-data

**Reinforcement learning from human feedback (RLHF) trains models to match human preferences. The core data is human judgments comparing model outputs, most often "which of these two responses is better?".** Collecting that data well is an annotation problem, and it is one Potato is built for.

See [reinforcement learning from human feedback](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback) for background.

## The standard recipe: pairwise preference

Show a prompt and two candidate responses; the annotator picks the better one. These judgments train a [reward model](https://en.wikipedia.org/wiki/Reward_model) that scores outputs, which then guides the policy model.

```yaml
annotation_schemes:
  - annotation_type: pairwise
    name: preference
    description: "Which response better follows the instruction and is more helpful and harmless?"
    mode: binary
    allow_tie: true
  - annotation_type: text
    name: rationale
    description: "One sentence on why you chose it."
    label_requirement:
      required: false
```

A short rationale is worth collecting: it lets you audit the preference data and find cases where annotators optimized the wrong thing (length, formatting) instead of quality.

See [Pairwise and Best–Worst Scaling](/docs/guides/pairwise-and-best-worst) for the comparison mechanics and [Pairwise Model Comparison](/docs/guides/pairwise-model-comparison) for evaluating models head-to-head.

## Multi-dimensional preferences

A single "better" judgment hides trade-offs. To collect signal on *why* one output wins, score several criteria with a rubric:

```yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: quality
    description: "Rate the response on each dimension."
    scale_points: 5
    criteria:
      - {name: Helpfulness, description: "Does it actually answer the request?"}
      - {name: Harmlessness, description: "Is it safe and appropriate?"}
      - {name: Honesty, description: "Is it accurate and non-misleading?"}
```

See [Rubric-Based LLM Evaluation](/docs/guides/rubric-based-llm-evaluation).

## Quality pitfalls specific to preference data

- **Length and style bias.** Annotators often prefer longer or more confident answers regardless of correctness. Name this in the guidelines and watch for it.
- **Position bias.** Randomize which response is shown as "A".
- **Calibration drift.** Re-share anchor examples periodically so standards don't drift across a long campaign.
- **Agreement.** Preference is subjective; collect overlap and track [agreement](/docs/guides/inter-annotator-agreement).

## Further reading

- [Pairwise and Best–Worst Scaling](/docs/guides/pairwise-and-best-worst)
- [Rubric-Based LLM Evaluation](/docs/guides/rubric-based-llm-evaluation)
- [Evaluating AI Agents](/docs/guides/evaluating-ai-agents)
