Skip to content
このページはまだお使いの言語に翻訳されていません。英語版を表示しています。

Collecting RLHF and Preference Data

How to collect human preference data for RLHF and model alignment, pairwise comparisons, rubric scoring, and justifications, with Potato.

Reinforcement learning from human feedback (RLHF) trains models to match human preferences. The core data is human judgments comparing model outputs, most often "which of these two responses is better?". Collecting that data well is an annotation problem, and it is one Potato is built for.

See reinforcement learning from human feedback for background.

The standard recipe: pairwise preference

Show a prompt and two candidate responses; the annotator picks the better one. These judgments train a reward model that scores outputs, which then guides the policy model.

yaml
annotation_schemes:
  - annotation_type: pairwise
    name: preference
    description: "Which response better follows the instruction and is more helpful and harmless?"
    mode: binary
    allow_tie: true
  - annotation_type: text
    name: rationale
    description: "One sentence on why you chose it."
    label_requirement:
      required: false

A short rationale is worth collecting: it lets you audit the preference data and find cases where annotators optimized the wrong thing (length, formatting) instead of quality.

See Pairwise and Best–Worst Scaling for the comparison mechanics and Pairwise Model Comparison for evaluating models head-to-head.

Multi-dimensional preferences

A single "better" judgment hides trade-offs. To collect signal on why one output wins, score several criteria with a rubric:

yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: quality
    description: "Rate the response on each dimension."
    scale_points: 5
    criteria:
      - {name: Helpfulness, description: "Does it actually answer the request?"}
      - {name: Harmlessness, description: "Is it safe and appropriate?"}
      - {name: Honesty, description: "Is it accurate and non-misleading?"}

See Rubric-Based LLM Evaluation.

Quality pitfalls specific to preference data

  • Length and style bias. Annotators often prefer longer or more confident answers regardless of correctness. Name this in the guidelines and watch for it.
  • Position bias. Randomize which response is shown as "A".
  • Calibration drift. Re-share anchor examples periodically so standards don't drift across a long campaign.
  • Agreement. Preference is subjective; collect overlap and track agreement.

Further reading