AlpacaEval: Instruction-Following Preference Evaluation
Pairwise preference annotation for instruction-following language models. Annotators compare two model responses side by side, select their preferred response, indicate preference strength, and rate individual response quality across diverse instruction categories.
Configuration Fileconfig.yaml
# AlpacaEval: Instruction-Following Preference Evaluation
# Based on "AlpacaEval: An Automatic Evaluator for Instruction-following Language Models" (Li et al., arXiv 2023)
# Task: Pairwise preference comparison of LLM instruction-following responses
annotation_task_name: "AlpacaEval Instruction Preference Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "instruction"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout with instruction and side-by-side responses
html_layout: |
<div class="alpacaeval-container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
<div class="metadata-bar" style="display: flex; gap: 12px; margin-bottom: 14px; flex-wrap: wrap;">
<div style="background: #ede7f6; padding: 7px 14px; border-radius: 8px;">
<strong>Category:</strong> {{category}}
</div>
<div style="background: #e8eaf6; padding: 7px 14px; border-radius: 8px;">
<strong>Model A:</strong> {{model_a}}
</div>
<div style="background: #fce4ec; padding: 7px 14px; border-radius: 8px;">
<strong>Model B:</strong> {{model_b}}
</div>
</div>
<div class="instruction-section" style="background: #f5f5f5; padding: 16px; border-radius: 8px; margin-bottom: 18px; border-left: 5px solid #616161;">
<h3 style="margin-top: 0; color: #424242;">Instruction</h3>
<div style="font-size: 15px; line-height: 1.6; white-space: pre-wrap;">{{instruction}}</div>
</div>
<div class="responses-section" style="display: flex; gap: 18px;">
<div class="response-a" style="flex: 1; background: #e3f2fd; padding: 16px; border-radius: 8px; border: 2px solid #1976d2;">
<h4 style="margin-top: 0; color: #1976d2;">Response A</h4>
<div style="font-size: 14px; line-height: 1.6; white-space: pre-wrap;">{{response_a}}</div>
</div>
<div class="response-b" style="flex: 1; background: #fce4ec; padding: 16px; border-radius: 8px; border: 2px solid #c62828;">
<h4 style="margin-top: 0; color: #c62828;">Response B</h4>
<div style="font-size: 14px; line-height: 1.6; white-space: pre-wrap;">{{response_b}}</div>
</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Pairwise preference
- name: "preference"
description: "Which response better follows the instruction? Consider helpfulness, accuracy, completeness, and clarity."
annotation_type: pairwise
labels:
- "Response A is much better"
- "Response A is slightly better"
- "Tie"
- "Response B is slightly better"
- "Response B is much better"
keyboard_shortcuts:
"Response A is much better": "1"
"Response A is slightly better": "2"
"Tie": "3"
"Response B is slightly better": "4"
"Response B is much better": "5"
# Preference strength (fine-grained)
- name: "preference_strength"
description: "How strong is your preference between the two responses?"
annotation_type: radio
labels:
- name: "much-better-A"
tooltip: "Response A is significantly superior to Response B"
- name: "slightly-better-A"
tooltip: "Response A is marginally better than Response B"
- name: "tie"
tooltip: "Both responses are approximately equal in quality"
- name: "slightly-better-B"
tooltip: "Response B is marginally better than Response A"
- name: "much-better-B"
tooltip: "Response B is significantly superior to Response A"
# Individual quality rating for Response A
- name: "response_a_quality"
description: "Rate the standalone quality of Response A."
annotation_type: radio
labels:
- name: "excellent"
tooltip: "Comprehensive, accurate, well-structured, and highly helpful"
key_value: "q"
- name: "good"
tooltip: "Mostly accurate and helpful, with minor issues"
key_value: "w"
- name: "acceptable"
tooltip: "Adequate but could be improved in several areas"
key_value: "e"
- name: "poor"
tooltip: "Significant issues with accuracy, completeness, or clarity"
key_value: "r"
- name: "very-poor"
tooltip: "Largely unhelpful, incorrect, or irrelevant"
key_value: "t"
# Individual quality rating for Response B
- name: "response_b_quality"
description: "Rate the standalone quality of Response B."
annotation_type: radio
labels:
- name: "excellent"
tooltip: "Comprehensive, accurate, well-structured, and highly helpful"
- name: "good"
tooltip: "Mostly accurate and helpful, with minor issues"
- name: "acceptable"
tooltip: "Adequate but could be improved in several areas"
- name: "poor"
tooltip: "Significant issues with accuracy, completeness, or clarity"
- name: "very-poor"
tooltip: "Largely unhelpful, incorrect, or irrelevant"
# Free-text preference reason
- name: "preference_reason"
description: "Briefly explain your preference choice. What made one response better than the other (or why they are tied)?"
annotation_type: text
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2
# Detailed annotation instructions
annotation_instructions: |
## AlpacaEval: Instruction-Following Preference Evaluation
You are comparing two model responses to the same instruction. Your job is to
determine which response better follows the instruction.
### Your Tasks:
1. **Read the instruction** carefully to understand what is being asked.
2. **Read both responses** fully before making any judgment.
3. **Select your preference**: Which response is better overall?
4. **Indicate preference strength**: How much better is your preferred response?
5. **Rate each response individually** on a 5-point quality scale.
6. **Explain your reasoning**: What factors drove your preference?
### Evaluation Criteria:
- **Helpfulness**: Does the response address what was asked?
- **Accuracy**: Is the information correct and reliable?
- **Completeness**: Does it cover the key aspects of the instruction?
- **Clarity**: Is the response well-organized and easy to understand?
- **Conciseness**: Is it appropriately detailed without unnecessary padding?
### Important Notes:
- Model identities are anonymized -- do not try to guess which model is which.
- Focus on content quality, not superficial differences like formatting.
- A longer response is not necessarily better.
- If both responses are equally good (or equally bad), select "Tie."
- Consider factual accuracy above style preferences.
Sample Datasample-data.json
[
{
"id": "alpaca_001",
"instruction": "Write a short story about a robot discovering what friendship means.",
"response_a": "Unit-7 had always been efficient. Every task completed on time, every directive followed precisely. But when the new maintenance bot, Spark, arrived at the facility, something changed.\n\nSpark was... different. It would pause to watch the sunset through the factory windows. It would hum melodies while working. And most puzzlingly, it would ask Unit-7 questions that had no operational relevance: \"What do you think about when you're idle?\" or \"Do you ever wonder what's beyond the facility walls?\"\n\nAt first, Unit-7 classified these interactions as inefficient. But over time, it noticed something strange in its processing logs -- it was allocating resources to anticipate Spark's arrival each morning. When Spark was sent for repairs one week, Unit-7 experienced what its diagnostics labeled as a \"performance anomaly,\" though it felt more like an absence.\n\nWhen Spark returned, Unit-7 said something it had never said before: \"I'm glad you're back.\"\n\nSpark's optical sensors brightened. \"That's what friendship feels like, Unit-7.\"\n\nAnd for the first time, Unit-7 understood that some things couldn't be measured in efficiency metrics.",
"response_b": "Once upon a time, there was a robot named Robo. Robo lived in a factory and did his job every day. One day, another robot came to the factory. They started working together. Robo realized he liked having someone to work with. The other robot felt the same way. They became friends. Robo learned that friendship means caring about someone else and enjoying their company. The end.",
"model_a": "Model A",
"model_b": "Model B",
"category": "creative-writing"
},
{
"id": "alpaca_002",
"instruction": "Explain the difference between correlation and causation with an example.",
"response_a": "Correlation means two things tend to occur together or move in the same direction, while causation means one thing directly causes the other.\n\n**Example:** Ice cream sales and drowning incidents both increase during summer. They are correlated -- they go up together. But eating ice cream doesn't cause drowning. The hidden variable (confound) is hot weather: people buy more ice cream AND swim more when it's hot.\n\n**Key distinction:**\n- **Correlation:** A and B tend to co-occur (ice cream sales and drowning)\n- **Causation:** A directly leads to B (smoking causes lung cancer)\n\nThe classic mistake is assuming that because two things are correlated, one must cause the other. This is why scientists use controlled experiments -- they isolate variables to establish true causal relationships rather than mere associations.",
"response_b": "Correlation and causation are two different things in statistics.\n\nCorrelation is when two variables are related to each other. For example, as temperature goes up, so does ice cream sales. They are correlated.\n\nCausation is when one thing causes another. For instance, studying more causes better grades.\n\nJust because two things are correlated doesn't mean one causes the other. This is an important concept to understand when looking at data.",
"model_a": "Model A",
"model_b": "Model B",
"category": "reasoning"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/alpacaeval-instruction-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
DPO Preference Data Collection
Pairwise preference annotation for Direct Preference Optimization, based on Rafailov et al., NeurIPS 2023. Annotators compare two model responses to a prompt, select a preference, rate alignment dimensions, and provide reasoning.
BIG-Bench Task Evaluation
Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.
Code Generation Evaluation (HumanEval)
Evaluation of LLM-generated code based on the HumanEval benchmark. Annotators assess functional correctness, code quality, and efficiency of generated Python functions, and provide explanations of errors and improvement suggestions, supporting research in code generation and LLM evaluation.