WildBench - LLM Evaluation on Real-World Tasks
Evaluation of LLM outputs on challenging real-world user queries from WildBench. Annotators compare two model responses via pairwise preference, rate overall quality on a Likert scale, and provide reasoning for their judgments.
Configuration Fileconfig.yaml
# WildBench - LLM Evaluation on Real-World Tasks
# Based on Lin et al., COLM 2024
# Paper: https://arxiv.org/abs/2406.04770
# Dataset: https://huggingface.co/datasets/allenai/WildBench
#
# This task evaluates LLM outputs on challenging real-world user queries.
# Annotators compare two model responses via pairwise preference, rate
# overall quality on a Likert scale, and provide written reasoning.
#
# Evaluation Criteria:
# - Helpfulness: Does the response address the user's needs?
# - Accuracy: Is the information correct and reliable?
# - Depth: Does the response provide sufficient detail?
# - Clarity: Is the response well-organized and easy to follow?
#
# Annotation Guidelines:
# 1. Read the user query carefully to understand the intent
# 2. Read both model responses thoroughly
# 3. Select which response is better overall, or mark as tie
# 4. Rate the overall quality of the better response on a 1-5 scale
# 5. Provide a brief explanation of your reasoning
annotation_task_name: "WildBench LLM Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Pairwise comparison
- annotation_type: pairwise
name: preference
description: "Which model response better addresses the user query?"
mode: "binary"
labels:
- "Model A Better"
- "Model B Better"
- "Tie"
keyboard_shortcuts:
"Model A Better": "a"
"Model B Better": "b"
"Tie": "t"
tooltips:
"Model A Better": "Response A is clearly superior in addressing the query"
"Model B Better": "Response B is clearly superior in addressing the query"
"Tie": "Both responses are roughly equal in quality"
# Step 2: Overall quality rating
- annotation_type: likert
name: overall_quality
description: "Rate the overall quality of the responses on a 1-5 scale."
min_label: "Very Poor"
max_label: "Excellent"
size: 5
# Step 3: Written reasoning
- annotation_type: text
name: reasoning
description: "Briefly explain why you preferred one response over the other."
textarea: true
required: false
placeholder: "What factors influenced your preference judgment?"
annotation_instructions: |
You will evaluate pairs of LLM responses to real-world user queries from WildBench.
For each item:
1. Read the user query carefully.
2. Read both model responses (A and B) thoroughly.
3. Select which response is better, or mark as Tie.
4. Rate the overall quality on a 1-5 scale.
5. Explain your reasoning briefly.
Consider: helpfulness, accuracy, depth, clarity, and relevance to the query.
html_layout: |
<div style="padding: 15px; max-width: 900px; margin: auto;">
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">User Query:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="display: flex; gap: 16px;">
<div style="flex: 1; background: #e3f2fd; border: 2px solid #1976d2; border-radius: 8px; padding: 16px;">
<h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_a}}</div>
</div>
<div style="flex: 1; background: #fce4ec; border: 2px solid #c62828; border-radius: 8px; padding: 16px;">
<h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_b}}</div>
</div>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "wildbench_001",
"text": "I need to write a resignation letter for my current position as a software engineer. I want to keep it professional but also express gratitude for the opportunities I've had. Can you help me draft one?",
"response_a": "Subject: Resignation - [Your Name]\n\nDear [Manager's Name],\n\nI am writing to formally notify you of my resignation from my position as Software Engineer at [Company], effective [last working day, typically two weeks from now].\n\nI want to express my sincere gratitude for the opportunities I've had during my time here. Working on [specific project] was a highlight of my career, and I've grown significantly as an engineer thanks to the mentorship and collaborative environment.\n\nI am committed to ensuring a smooth transition. I will document my current projects and am happy to help train my replacement during my remaining time.\n\nThank you for everything.\n\nBest regards,\n[Your Name]",
"response_b": "Here's a resignation letter:\n\nDear Boss,\n\nI quit. My last day will be in two weeks.\n\nThanks for the job. I learned some stuff here.\n\nBye,\n[Name]"
},
{
"id": "wildbench_002",
"text": "Explain the difference between TCP and UDP protocols. When would you choose one over the other? Give real examples.",
"response_a": "TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are both transport layer protocols, but they differ in key ways:\n\n**TCP:**\n- Connection-oriented: establishes a connection before data transfer (3-way handshake)\n- Reliable: guarantees delivery, ordering, and error-checking\n- Slower due to overhead\n- Use cases: web browsing (HTTP/HTTPS), email (SMTP), file transfer (FTP)\n\n**UDP:**\n- Connectionless: sends data without establishing a connection\n- Unreliable: no guarantee of delivery or ordering\n- Faster with less overhead\n- Use cases: video streaming, online gaming, DNS queries, VoIP\n\n**When to choose:**\n- Choose TCP when data integrity is critical (financial transactions, file downloads)\n- Choose UDP when speed matters more than reliability (live streaming, gaming where occasional packet loss is acceptable)",
"response_b": "TCP and UDP are networking protocols. TCP is more reliable and UDP is faster. TCP is used for websites and UDP is used for streaming. They both send data over the internet but in different ways. You should pick the one that fits your needs."
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/wildbench-llm-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
MT-Bench: LLM Response Quality Evaluation
LLM response quality evaluation using pairwise comparison and absolute scoring. Annotators compare pairs of LLM responses and rate individual responses on a 1-10 scale across multiple turns.
Prometheus: Rubric-based LLM Evaluation
Fine-grained rubric-based evaluation of LLM outputs. Annotators score responses against detailed rubrics (1-5 scale) with specific criteria for each score level, and provide written feedback.
Unlearning Sensitive Content from LLMs
Evaluation of whether language models have successfully unlearned sensitive content, requiring annotators to assess model outputs for residual sensitive information leakage. Based on SemEval-2025 Task 4.