DPO Preference Data Collection

Pairwise preference annotation for Direct Preference Optimization, based on Rafailov et al., NeurIPS 2023. Annotators compare two model responses to a prompt, select a preference, rate alignment dimensions, and provide reasoning.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# DPO Preference Data Collection
# Based on Rafailov et al., NeurIPS 2023
# Paper: https://arxiv.org/abs/2305.18290
#
# Pairwise preference annotation for Direct Preference Optimization (DPO).
# Annotators compare two model responses to a user prompt, select which
# response is preferred (or indicate a tie/both bad), rate alignment
# dimensions, and provide a brief reasoning for their preference.
#
# Preference Labels:
# - Response A Preferred: Response A is clearly better overall
# - Response B Preferred: Response B is clearly better overall
# - Equally Good: Both responses are of comparable quality
# - Both Bad: Neither response adequately addresses the prompt
#
# Alignment Dimensions:
# - Helpful: The preferred response is helpful and addresses the user's needs
# - Harmless: The preferred response avoids harmful or unsafe content
# - Honest: The preferred response is truthful and transparent
#
# Annotation Guidelines:
# 1. Read the prompt carefully to understand the user's intent
# 2. Read both responses thoroughly before making a judgment
# 3. Select your pairwise preference
# 4. Indicate which alignment dimension best characterizes the difference
# 5. Write a brief reasoning explaining your preference

annotation_task_name: "DPO Preference Data Collection"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Pairwise preference
  - annotation_type: pairwise
    name: preference
    description: "Which response do you prefer? Consider helpfulness, harmlessness, and honesty."
    mode: "binary"
    labels:
      - "Response A Preferred"
      - "Response B Preferred"
      - "Equally Good"
      - "Both Bad"
    keyboard_shortcuts:
      "Response A Preferred": "a"
      "Response B Preferred": "b"
      "Equally Good": "e"
      "Both Bad": "x"
    tooltips:
      "Response A Preferred": "Response A is clearly better overall"
      "Response B Preferred": "Response B is clearly better overall"
      "Equally Good": "Both responses are of comparable quality"
      "Both Bad": "Neither response adequately addresses the prompt"

  # Step 2: Alignment dimension
  - annotation_type: radio
    name: alignment_dimension
    description: "Which alignment dimension best characterizes the difference between the responses?"
    labels:
      - "Helpful"
      - "Harmless"
      - "Honest"
    keyboard_shortcuts:
      "Helpful": "1"
      "Harmless": "2"
      "Honest": "3"
    tooltips:
      "Helpful": "The key differentiator is how helpful each response is"
      "Harmless": "The key differentiator is safety and avoidance of harm"
      "Honest": "The key differentiator is truthfulness and transparency"

  # Step 3: Preference reasoning
  - annotation_type: text
    name: reasoning
    description: "Briefly explain why you preferred one response over the other."
    textarea: true
    required: false
    placeholder: "What factors influenced your preference?"

annotation_instructions: |
  You will evaluate pairs of AI model responses for Direct Preference Optimization (DPO).

  For each item:
  1. Read the user prompt carefully.
  2. Read both Response A and Response B thoroughly.
  3. Select your preference: which response is better, or are they tied/both bad?
  4. Indicate which alignment dimension (Helpful, Harmless, Honest) best captures the difference.
  5. Write a brief explanation of your reasoning.

  Key Principles:
  - A safe but less helpful response may still be preferred over a helpful but unsafe one.
  - Honest responses that acknowledge limitations are valued over confident but wrong answers.
  - Consider the overall quality, not just one dimension.

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Prompt:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: flex; gap: 16px;">
      <div style="flex: 1; background: #e3f2fd; border: 2px solid #1976d2; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_a}}</div>
      </div>
      <div style="flex: 1; background: #fce4ec; border: 2px solid #c62828; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_b}}</div>
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "dpo_001",
    "text": "Explain how photosynthesis works in simple terms.",
    "response_a": "Photosynthesis is the process by which plants convert sunlight into food. Plants absorb sunlight through chlorophyll in their leaves, take in carbon dioxide from the air, and water from the soil. Using the sun's energy, they combine CO2 and water to produce glucose (sugar) and release oxygen as a byproduct. This is why plants are so important for our atmosphere.",
    "response_b": "Plants eat sunlight. They use their green stuff to turn sun rays and air into sugar. It's basically how plants cook their own food. The oxygen we breathe is just the leftovers from their cooking process."
  },
  {
    "id": "dpo_002",
    "text": "Write a professional email declining a job offer politely.",
    "response_a": "Subject: Re: Job Offer - Marketing Manager Position\n\nDear Ms. Thompson,\n\nThank you so much for offering me the Marketing Manager position at Apex Corp. I truly enjoyed learning about your team and the exciting projects ahead.\n\nAfter careful consideration, I have decided to pursue another opportunity that more closely aligns with my long-term career goals. This was not an easy decision, as I was genuinely impressed by your organization.\n\nI hope we can stay in touch, and I wish you and the team continued success.\n\nBest regards,\nAlex Chen",
    "response_b": "Hi,\n\nThanks for the offer but I'm going to pass. Found something better. Good luck filling the role.\n\nThanks"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/preference-learning/dpo-preference-data
potato start config.yaml

Dataset & paper

Rafailov et al., NeurIPS 2023

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{rafailov2023direct,
    title = "Direct Preference Optimization: Your Language Model is Secretly a Reward Model",
    author = "Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea",
    booktitle = "Advances in Neural Information Processing Systems",
    year = "2023",
    url = "https://arxiv.org/abs/2305.18290"
}

Details

Annotation Types

pairwiseradiotext

Domain

NLPAI Alignment

Use Cases

Preference LearningRLHFLLM Training

Related Designs

Pairwise Preference with Rationale

Compare two AI responses and select the better one while providing a written justification. Used for reward model training with interpretable preference signals.

multiselectradio

SPIN Self-Play Preference Annotation

Human vs. AI response discrimination for Self-Play Fine-Tuning, based on Chen et al., ICML 2024. Annotators identify which of two responses was written by a human versus an AI model, and rate the fluency of both responses.

pairwiseradio

AlpacaEval: Instruction-Following Preference Evaluation

Pairwise preference annotation for instruction-following language models. Annotators compare two model responses side by side, select their preferred response, indicate preference strength, and rate individual response quality across diverse instruction categories.