SPIN Self-Play Preference Annotation

Human vs. AI response discrimination for Self-Play Fine-Tuning, based on Chen et al., ICML 2024. Annotators identify which of two responses was written by a human versus an AI model, and rate the fluency of both responses.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# SPIN Self-Play Preference Annotation
# Based on Chen et al., ICML 2024
# Paper: https://arxiv.org/abs/2401.01335
# Dataset: https://huggingface.co/datasets/UCLA-AGI/SPIN_iter0
#
# Annotation task for Self-Play Fine-Tuning (SPIN). The key idea is to
# train a language model to distinguish its own outputs from human-written
# responses. Annotators compare two responses to a prompt and identify
# which was written by a human versus generated by an AI model.
#
# Preference Labels:
# - Human Response: The annotator believes this is the human-written response
# - AI Response: The annotator believes this is the AI-generated response
# - Cannot Tell: The annotator cannot reliably distinguish between the two
#
# Fluency Rating:
# - Fluent: Both responses read naturally with no issues
# - Somewhat Fluent: Minor awkwardness or unnatural phrasing
# - Not Fluent: Significant issues with naturalness or coherence
#
# Annotation Guidelines:
# 1. Read the prompt carefully
# 2. Read both responses and try to identify which is human vs. AI
# 3. Select your judgment for the pairwise comparison
# 4. Rate the overall fluency of the responses

annotation_task_name: "SPIN Self-Play Preference Annotation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Identify human vs. AI response
  - annotation_type: pairwise
    name: source_preference
    description: "Which response do you believe is the human-written response?"
    mode: "binary"
    labels:
      - "Human Response"
      - "AI Response"
      - "Cannot Tell"
    keyboard_shortcuts:
      "Human Response": "a"
      "AI Response": "b"
      "Cannot Tell": "c"
    tooltips:
      "Human Response": "Response A appears to be written by a human"
      "AI Response": "Response A appears to be generated by AI (Response B is human)"
      "Cannot Tell": "Cannot reliably distinguish which is human-written"

  # Step 2: Fluency rating
  - annotation_type: radio
    name: fluency
    description: "How fluent and natural do the responses read overall?"
    labels:
      - "Fluent"
      - "Somewhat Fluent"
      - "Not Fluent"
    keyboard_shortcuts:
      "Fluent": "1"
      "Somewhat Fluent": "2"
      "Not Fluent": "3"
    tooltips:
      "Fluent": "Both responses read naturally with no awkwardness"
      "Somewhat Fluent": "Minor awkwardness or unnatural phrasing in one or both"
      "Not Fluent": "Significant issues with naturalness or coherence"

annotation_instructions: |
  You will see a prompt and two responses. One was written by a human and the other
  was generated by an AI language model. Your task is to identify which is which.

  For each item:
  1. Read the prompt carefully.
  2. Read both Response A and Response B thoroughly.
  3. Try to identify which response was written by a human and which by AI.
  4. Rate the overall fluency of both responses.

  Tips for distinguishing human vs. AI:
  - Human responses may be more concise or contain informal language
  - AI responses may be more structured, verbose, or formulaic
  - Look for patterns like numbered lists, hedging language, or overly balanced viewpoints
  - Trust your intuition - if both seem equally natural, select "Cannot Tell"

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Prompt:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: flex; gap: 16px;">
      <div style="flex: 1; background: #e3f2fd; border: 2px solid #1976d2; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_a}}</div>
      </div>
      <div style="flex: 1; background: #fce4ec; border: 2px solid #c62828; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_b}}</div>
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "spin_001",
    "text": "What's the best way to learn a new language as an adult?",
    "response_a": "Honestly, immersion is the best way. I moved to Spain knowing barely any Spanish and within 6 months I was conversational. Duolingo and classes help, but nothing beats being forced to use the language every day. Also, don't be afraid to make mistakes - locals appreciate the effort.",
    "response_b": "Learning a new language as an adult can be achieved through several effective strategies:\n\n1. Consistent daily practice (even 15-30 minutes)\n2. Immersive experiences through media, conversation partners, or travel\n3. Spaced repetition systems for vocabulary retention\n4. Focus on high-frequency words first\n5. Practice speaking early, even if imperfectly\n\nResearch suggests that adults can actually learn languages efficiently due to their existing knowledge of grammar concepts and learning strategies."
  },
  {
    "id": "spin_002",
    "text": "Explain why exercise is important for mental health.",
    "response_a": "Exercise triggers the release of endorphins, serotonin, and other neurotransmitters that naturally improve mood and reduce stress. Regular physical activity has been shown to:\n\n1. Reduce symptoms of depression and anxiety\n2. Improve sleep quality\n3. Boost self-esteem and cognitive function\n4. Provide a healthy coping mechanism for stress\n5. Increase social interaction opportunities\n\nStudies indicate that even moderate exercise, such as 30 minutes of walking three times a week, can have significant positive effects on mental well-being.",
    "response_b": "When I started running three years ago after a rough patch, I didn't expect it to help my anxiety as much as it did. There's something about the physical exhaustion that quiets the mental noise. Science backs this up too - exercise releases feel-good chemicals in your brain. But beyond the chemistry, it gives you structure, goals, and sometimes community. Even on days I don't feel like going, I've never regretted a workout."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/preference-learning/spin-self-play
potato start config.yaml

Dataset & paper

Chen et al., ICML 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{chen2024self,
    title = "Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models",
    author = "Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan",
    booktitle = "Proceedings of the 41st International Conference on Machine Learning",
    year = "2024",
    url = "https://arxiv.org/abs/2401.01335"
}

Details

Annotation Types

pairwiseradio

Domain

NLPAI Alignment

Use Cases

Preference LearningSelf-PlayLLM Training

Related Designs

DPO Preference Data Collection

Pairwise preference annotation for Direct Preference Optimization, based on Rafailov et al., NeurIPS 2023. Annotators compare two model responses to a prompt, select a preference, rate alignment dimensions, and provide reasoning.

pairwiseradio

RewardBench - Reward Model Evaluation

Evaluation of reward model preferences via pairwise comparison of chosen and rejected responses. Annotators judge which response is better across chat, chat hard, safety, and reasoning categories, and rate response quality on multiple dimensions.

pairwiseradio

RewardBench: Reward Model Evaluation via Pairwise Preference

Reward model evaluation through pairwise preference annotation. Annotators choose between a 'chosen' (better) and 'rejected' (worse) response across RewardBench's four categories (chat, chat hard, safety, reasoning).

pairwiseradio

SPIN Self-Play Preference Annotation

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Dataset & paper

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

DPO Preference Data Collection

RewardBench - Reward Model Evaluation

RewardBench: Reward Model Evaluation via Pairwise Preference