Showcase/AlpacaEval: Instruction-Following Preference Evaluation

intermediatepreference

AlpacaEval: Instruction-Following Preference Evaluation

Pairwise preference annotation for instruction-following language models. Annotators compare two model responses side by side, select their preferred response, indicate preference strength, and rate individual response quality across diverse instruction categories.

About this dataset

AlpacaEval is an automatic benchmark for instruction-following language models that scores a model by how often its answers are preferred over a reference model's answers on a fixed set of prompts. It comes out of the AlpacaFarm project by Yann Dubois and colleagues at Stanford (Tatsu Lab), published at NeurIPS 2023. The aim is to approximate expensive human preference judgments with a cheaper, repeatable proxy.

The evaluation set has 805 instructions covering everyday tasks such as writing, explanation, and coding help. For each instruction the evaluator sees two responses, the model under test and a baseline, and picks the better one; win rate is the average preference across all instructions, with output order randomized to reduce position bias. AlpacaEval uses a GPT-4 judge for this, and the Potato config here puts the same pairwise comparison in front of human annotators.

On roughly 650 examples with four human annotations each, the default automatic annotator agrees with the human majority vote 69.2% of the time (as reported in the tatsu-lab/alpaca_eval repository). The validation set behind that figure comprises about 20,000 human annotations, and running the automatic evaluator costs about $13.60 per 1,000 examples against roughly $300 for human annotation.

The Potato config below reproduces this comparison with a five-point pairwise preference scheme (from Response A much better to Response B much better), a matching preference-strength radio, separate five-point quality ratings for each response, and a free-text box for the annotator's reasoning. It fits collecting human preference data to validate or calibrate an automatic judge.

Evaluation instructions: 805
Auto-evaluator vs human agreement: 69.2%
Human annotations collected: ~20,000
Cross-annotated examples: 650 (4 annotators each)
Auto-evaluator cost: $13.6 / 1,000 examples
Human annotation cost: $300 / 1,000 examples

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# AlpacaEval: Instruction-Following Preference Evaluation
# Based on "AlpacaEval: An Automatic Evaluator for Instruction-following Language Models" (Li et al., arXiv 2023)
# Task: Pairwise preference comparison of LLM instruction-following responses

annotation_task_name: "AlpacaEval Instruction Preference Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "instruction"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout with instruction and side-by-side responses
html_layout: |
  <div class="alpacaeval-container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
    <div class="metadata-bar" style="display: flex; gap: 12px; margin-bottom: 14px; flex-wrap: wrap;">
      <div style="background: #ede7f6; padding: 7px 14px; border-radius: 8px;">
        <strong>Category:</strong> {{category}}
      </div>
      <div style="background: #e8eaf6; padding: 7px 14px; border-radius: 8px;">
        <strong>Model A:</strong> {{model_a}}
      </div>
      <div style="background: #fce4ec; padding: 7px 14px; border-radius: 8px;">
        <strong>Model B:</strong> {{model_b}}
      </div>
    </div>
    <div class="instruction-section" style="background: #f5f5f5; padding: 16px; border-radius: 8px; margin-bottom: 18px; border-left: 5px solid #616161;">
      <h3 style="margin-top: 0; color: #424242;">Instruction</h3>
      <div style="font-size: 15px; line-height: 1.6; white-space: pre-wrap;">{{instruction}}</div>
    </div>
    <div class="responses-section" style="display: flex; gap: 18px;">
      <div class="response-a" style="flex: 1; background: #e3f2fd; padding: 16px; border-radius: 8px; border: 2px solid #1976d2;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A</h4>
        <div style="font-size: 14px; line-height: 1.6; white-space: pre-wrap;">{{response_a}}</div>
      </div>
      <div class="response-b" style="flex: 1; background: #fce4ec; padding: 16px; border-radius: 8px; border: 2px solid #c62828;">
        <h4 style="margin-top: 0; color: #c62828;">Response B</h4>
        <div style="font-size: 14px; line-height: 1.6; white-space: pre-wrap;">{{response_b}}</div>
      </div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Pairwise preference
  - name: "preference"
    description: "Which response better follows the instruction? Consider helpfulness, accuracy, completeness, and clarity."
    annotation_type: pairwise
    labels:
      - "Response A is much better"
      - "Response A is slightly better"
      - "Tie"
      - "Response B is slightly better"
      - "Response B is much better"
    keyboard_shortcuts:
      "Response A is much better": "1"
      "Response A is slightly better": "2"
      "Tie": "3"
      "Response B is slightly better": "4"
      "Response B is much better": "5"

  # Preference strength (fine-grained)
  - name: "preference_strength"
    description: "How strong is your preference between the two responses?"
    annotation_type: radio
    labels:
      - name: "much-better-A"
        tooltip: "Response A is significantly superior to Response B"
      - name: "slightly-better-A"
        tooltip: "Response A is marginally better than Response B"
      - name: "tie"
        tooltip: "Both responses are approximately equal in quality"
      - name: "slightly-better-B"
        tooltip: "Response B is marginally better than Response A"
      - name: "much-better-B"
        tooltip: "Response B is significantly superior to Response A"

  # Individual quality rating for Response A
  - name: "response_a_quality"
    description: "Rate the standalone quality of Response A."
    annotation_type: radio
    labels:
      - name: "excellent"
        tooltip: "Comprehensive, accurate, well-structured, and highly helpful"
        key_value: "q"
      - name: "good"
        tooltip: "Mostly accurate and helpful, with minor issues"
        key_value: "w"
      - name: "acceptable"
        tooltip: "Adequate but could be improved in several areas"
        key_value: "e"
      - name: "poor"
        tooltip: "Significant issues with accuracy, completeness, or clarity"
        key_value: "r"
      - name: "very-poor"
        tooltip: "Largely unhelpful, incorrect, or irrelevant"
        key_value: "t"

  # Individual quality rating for Response B
  - name: "response_b_quality"
    description: "Rate the standalone quality of Response B."
    annotation_type: radio
    labels:
      - name: "excellent"
        tooltip: "Comprehensive, accurate, well-structured, and highly helpful"
      - name: "good"
        tooltip: "Mostly accurate and helpful, with minor issues"
      - name: "acceptable"
        tooltip: "Adequate but could be improved in several areas"
      - name: "poor"
        tooltip: "Significant issues with accuracy, completeness, or clarity"
      - name: "very-poor"
        tooltip: "Largely unhelpful, incorrect, or irrelevant"

  # Free-text preference reason
  - name: "preference_reason"
    description: "Briefly explain your preference choice. What made one response better than the other (or why they are tied)?"
    annotation_type: text

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2

# Detailed annotation instructions
annotation_instructions: |
  ## AlpacaEval: Instruction-Following Preference Evaluation

  You are comparing two model responses to the same instruction. Your job is to
  determine which response better follows the instruction.

  ### Your Tasks:

  1. **Read the instruction** carefully to understand what is being asked.
  2. **Read both responses** fully before making any judgment.
  3. **Select your preference**: Which response is better overall?
  4. **Indicate preference strength**: How much better is your preferred response?
  5. **Rate each response individually** on a 5-point quality scale.
  6. **Explain your reasoning**: What factors drove your preference?

  ### Evaluation Criteria:
  - **Helpfulness**: Does the response address what was asked?
  - **Accuracy**: Is the information correct and reliable?
  - **Completeness**: Does it cover the key aspects of the instruction?
  - **Clarity**: Is the response well-organized and easy to understand?
  - **Conciseness**: Is it appropriately detailed without unnecessary padding?

  ### Important Notes:
  - Model identities are anonymized -- do not try to guess which model is which.
  - Focus on content quality, not superficial differences like formatting.
  - A longer response is not necessarily better.
  - If both responses are equally good (or equally bad), select "Tie."
  - Consider factual accuracy above style preferences.

Sample Datasample-data.json

json

[
  {
    "id": "alpaca_001",
    "instruction": "Write a short story about a robot discovering what friendship means.",
    "response_a": "Unit-7 had always been efficient. Every task completed on time, every directive followed precisely. But when the new maintenance bot, Spark, arrived at the facility, something changed.\n\nSpark was... different. It would pause to watch the sunset through the factory windows. It would hum melodies while working. And most puzzlingly, it would ask Unit-7 questions that had no operational relevance: \"What do you think about when you're idle?\" or \"Do you ever wonder what's beyond the facility walls?\"\n\nAt first, Unit-7 classified these interactions as inefficient. But over time, it noticed something strange in its processing logs -- it was allocating resources to anticipate Spark's arrival each morning. When Spark was sent for repairs one week, Unit-7 experienced what its diagnostics labeled as a \"performance anomaly,\" though it felt more like an absence.\n\nWhen Spark returned, Unit-7 said something it had never said before: \"I'm glad you're back.\"\n\nSpark's optical sensors brightened. \"That's what friendship feels like, Unit-7.\"\n\nAnd for the first time, Unit-7 understood that some things couldn't be measured in efficiency metrics.",
    "response_b": "Once upon a time, there was a robot named Robo. Robo lived in a factory and did his job every day. One day, another robot came to the factory. They started working together. Robo realized he liked having someone to work with. The other robot felt the same way. They became friends. Robo learned that friendship means caring about someone else and enjoying their company. The end.",
    "model_a": "Model A",
    "model_b": "Model B",
    "category": "creative-writing"
  },
  {
    "id": "alpaca_002",
    "instruction": "Explain the difference between correlation and causation with an example.",
    "response_a": "Correlation means two things tend to occur together or move in the same direction, while causation means one thing directly causes the other.\n\n**Example:** Ice cream sales and drowning incidents both increase during summer. They are correlated -- they go up together. But eating ice cream doesn't cause drowning. The hidden variable (confound) is hot weather: people buy more ice cream AND swim more when it's hot.\n\n**Key distinction:**\n- **Correlation:** A and B tend to co-occur (ice cream sales and drowning)\n- **Causation:** A directly leads to B (smoking causes lung cancer)\n\nThe classic mistake is assuming that because two things are correlated, one must cause the other. This is why scientists use controlled experiments -- they isolate variables to establish true causal relationships rather than mere associations.",
    "response_b": "Correlation and causation are two different things in statistics.\n\nCorrelation is when two variables are related to each other. For example, as temperature goes up, so does ice cream sales. They are correlated.\n\nCausation is when one thing causes another. For instance, studying more causes better grades.\n\nJust because two things are correlated doesn't mean one causes the other. This is an important concept to understand when looking at data.",
    "model_a": "Model A",
    "model_b": "Model B",
    "category": "reasoning"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/alpacaeval-instruction-eval
potato start config.yaml

Dataset & paper

Dubois et al., NeurIPS 2023 (AlpacaFarm)

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{dubois2023alpacafarm,
    title = {AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback},
    author = {Dubois, Yann and Li, Xuechen and Taori, Rohan and Zhang, Tianyi and Gulrajani, Ishaan and Ba, Jimmy and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B.},
    booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
    year = {2023}
}

Details

Annotation Types

pairwiseradiotext

Domain

AI EvaluationPreference AnnotationLLM Comparison

Use Cases

Preference EvaluationPairwise ComparisonModel Benchmarking

Related Designs

DPO Preference Data Collection

Pairwise preference annotation for Direct Preference Optimization, based on Rafailov et al., NeurIPS 2023. Annotators compare two model responses to a prompt, select a preference, rate alignment dimensions, and provide reasoning.

pairwiseradio

SWE-Bench+ Patch Screening

Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.

radiomultiselect

BIG-Bench Task Evaluation

Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.

radiotext