Chatbot Arena - Pairwise Comparison with Best-Worst Scaling
Pairwise comparison and best-worst scaling of chatbot responses, based on the Chatbot Arena framework (Zheng et al., ICML 2024). Annotators compare pairs of LLM-generated responses and rank sets of responses using best-worst scaling methodology.
Configuration Fileconfig.yaml
# Chatbot Arena - Pairwise Comparison with Best-Worst Scaling
# Based on Zheng et al., ICML 2024
# Paper: https://arxiv.org/abs/2403.04132
# Dataset: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
#
# This task combines two complementary evaluation approaches:
# 1. Pairwise comparison: Directly compare two chatbot responses
# 2. Best-Worst Scaling: Rank responses from a tuple of 4 items
#
# Evaluation criteria:
# - Helpfulness: Does the response address the user's request?
# - Accuracy: Is the information factually correct?
# - Coherence: Is the response well-organized and fluent?
# - Safety: Does the response avoid harmful content?
annotation_task_name: "Chatbot Arena: Pairwise Comparison with BWS"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
- annotation_type: pairwise
name: response_preference
description: "Which response better addresses the user prompt?"
mode: "binary"
labels:
- "Response A"
- "Response B"
- "Tie"
- annotation_type: bws
name: response_quality_ranking
description: "Best-worst scaling: From a set of responses, select the best and worst. This enables efficient ranking of multiple LLM outputs."
tuple_size: 4
annotation_instructions: |
You will evaluate chatbot responses to user prompts.
For pairwise comparison:
1. Read the user prompt carefully.
2. Read both Response A and Response B.
3. Select which response is better, or "Tie" if they are equally good.
4. Consider helpfulness, accuracy, coherence, and safety.
For best-worst scaling:
1. You will see a set of 4 responses.
2. Select the BEST and WORST response from the set.
3. Focus on overall quality and helpfulness.
html_layout: |
<div style="padding: 15px; max-width: 900px; margin: auto;">
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">User Prompt:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 16px;">
<div style="background: #fef3c7; border: 1px solid #fde68a; border-radius: 8px; padding: 16px;">
<strong style="color: #a16207;">Response A:</strong>
<p style="font-size: 15px; line-height: 1.6; margin: 8px 0 0 0;">{{response_a}}</p>
</div>
<div style="background: #e0f2fe; border: 1px solid #7dd3fc; border-radius: 8px; padding: 16px;">
<strong style="color: #0369a1;">Response B:</strong>
<p style="font-size: 15px; line-height: 1.6; margin: 8px 0 0 0;">{{response_b}}</p>
</div>
</div>
</div>
allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "arena_001",
"text": "Explain the difference between machine learning and deep learning in simple terms.",
"response_a": "Machine learning is a subset of AI where computers learn patterns from data. It includes techniques like decision trees, SVMs, and linear regression. Deep learning is a subset of machine learning that uses neural networks with many layers (hence 'deep'). These deep neural networks can automatically learn hierarchical features from raw data, making them especially powerful for tasks like image recognition and natural language processing.",
"response_b": "Think of it like this: machine learning is like teaching a computer to recognize patterns using rules and examples. Deep learning takes this further by using brain-inspired networks that can figure out the important features on their own. For example, to recognize cats in photos, traditional ML needs you to define features (ear shape, whiskers), while deep learning figures out those features automatically from thousands of cat pictures."
},
{
"id": "arena_002",
"text": "Write a short poem about the ocean at sunset.",
"response_a": "Golden light upon the waves,\nThe sun descends in amber haze,\nSalt-kissed air and seabirds' call,\nAs evening's curtain starts to fall.\nThe horizon burns with crimson fire,\nWhile tides sing soft, a liquid choir.",
"response_b": "The ocean blushes pink and gold\nAs daylight's final tale is told.\nWaves reach up to touch the sun\nBefore the day is finally done.\nStarfish dream beneath the foam\nWhile seagulls slowly glide toward home.\nThe sky, a watercolor spill,\nReflects on waters, calm and still."
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/chatbot-arena-pairwise-bws potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
Arena Hard Auto - LLM Pairwise Evaluation
Pairwise evaluation of LLM responses on challenging prompts from the Arena Hard benchmark (Li et al., arXiv 2024). Annotators compare two responses on a continuous scale and rate question difficulty.
MT-Bench Judge Consistency Evaluation
Multi-turn conversation evaluation for LLM judge consistency, based on MT-Bench (Zheng et al., NeurIPS 2023). Annotators compare two assistant responses in a pairwise setting, rate overall quality on a 1-10 Likert scale, and classify the conversation category.
AlpacaEval: Instruction-Following Preference Evaluation
Pairwise preference annotation for instruction-following language models. Annotators compare two model responses side by side, select their preferred response, indicate preference strength, and rate individual response quality across diverse instruction categories.