Collecting RLHF and Preference Data
How to collect human preference data for RLHF and model alignment, pairwise comparisons, rubric scoring, and justifications, with Potato.
Reinforcement learning from human feedback (RLHF) trains models to match human preferences. The core data is human judgments comparing model outputs, most often "which of these two responses is better?". Collecting that data well is an annotation problem, and it is one Potato is built for.
See reinforcement learning from human feedback for background.
The standard recipe: pairwise preference
Show a prompt and two candidate responses; the annotator picks the better one. These judgments train a reward model that scores outputs, which then guides the policy model.
annotation_schemes:
- annotation_type: pairwise
name: preference
description: "Which response better follows the instruction and is more helpful and harmless?"
mode: binary
allow_tie: true
- annotation_type: text
name: rationale
description: "One sentence on why you chose it."
label_requirement:
required: falseA short rationale is worth collecting: it lets you audit the preference data and find cases where annotators optimized the wrong thing (length, formatting) instead of quality.
See Pairwise and Best–Worst Scaling for the comparison mechanics and Pairwise Model Comparison for evaluating models head-to-head.
Multi-dimensional preferences
A single "better" judgment hides trade-offs. To collect signal on why one output wins, score several criteria with a rubric:
annotation_schemes:
- annotation_type: rubric_eval
name: quality
description: "Rate the response on each dimension."
scale_points: 5
criteria:
- {name: Helpfulness, description: "Does it actually answer the request?"}
- {name: Harmlessness, description: "Is it safe and appropriate?"}
- {name: Honesty, description: "Is it accurate and non-misleading?"}See Rubric-Based LLM Evaluation.
Quality pitfalls specific to preference data
- Length and style bias. Annotators often prefer longer or more confident answers regardless of correctness. Name this in the guidelines and watch for it.
- Position bias. Randomize which response is shown as "A".
- Calibration drift. Re-share anchor examples periodically so standards don't drift across a long campaign.
- Agreement. Preference is subjective; collect overlap and track agreement.