Arena Hard Auto - LLM Pairwise Evaluation
Pairwise evaluation of LLM responses on challenging prompts from the Arena Hard benchmark (Li et al., arXiv 2024). Annotators compare two responses on a continuous scale and rate question difficulty.
Configuration Fileconfig.yaml
# Arena Hard Auto - LLM Pairwise Evaluation
# Based on Li et al., arXiv 2024
# Paper: https://arxiv.org/abs/2406.11939
# Dataset: https://github.com/lm-sys/arena-hard-auto
#
# Pairwise evaluation of LLM responses on challenging prompts from the
# Arena Hard benchmark. Uses a continuous scale for nuanced preference
# judgments rather than simple binary choices. Annotators also rate the
# difficulty of each question.
#
# Pairwise Scale:
# - -3: Response A is much better
# - 0: Tie (both responses are equally good)
# - 3: Response B is much better
#
# Difficulty Rating:
# - 1 (Very Easy) to 5 (Very Hard)
#
# Annotation Guidelines:
# 1. Read the prompt carefully
# 2. Read both responses thoroughly
# 3. Use the scale to indicate your preference strength
# 4. Rate how difficult the prompt is to answer well
annotation_task_name: "Arena Hard Auto - LLM Pairwise Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Pairwise preference on a scale
- annotation_type: pairwise
name: preference
description: "Compare the two responses. Use the scale to indicate how much better one is than the other."
mode: "scale"
scale:
min: -3
max: 3
step: 1
labels:
"-3": "A much better"
"0": "Tie"
"3": "B much better"
# Step 2: Difficulty rating
- annotation_type: likert
name: difficulty
description: "How difficult is this prompt to answer well?"
min_label: "Very Easy"
max_label: "Very Hard"
size: 5
annotation_instructions: |
You will evaluate pairs of LLM responses on challenging prompts from the Arena Hard benchmark.
For each item:
1. Read the user prompt carefully to understand what is being asked.
2. Read both Response A and Response B thoroughly.
3. Use the continuous scale (-3 to +3) to indicate your preference:
- -3: Response A is much better
- -2: Response A is better
- -1: Response A is slightly better
- 0: Tie (both are equally good or equally bad)
- +1: Response B is slightly better
- +2: Response B is better
- +3: Response B is much better
4. Rate how difficult the prompt is to answer well (1 = Very Easy, 5 = Very Hard).
Consider: accuracy, completeness, clarity, helpfulness, and relevance.
html_layout: |
<div style="padding: 15px; max-width: 900px; margin: auto;">
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Prompt:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="display: flex; gap: 16px;">
<div style="flex: 1; background: #e3f2fd; border: 2px solid #1976d2; border-radius: 8px; padding: 16px;">
<h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_a}}</div>
</div>
<div style="flex: 1; background: #fce4ec; border: 2px solid #c62828; border-radius: 8px; padding: 16px;">
<h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_b}}</div>
</div>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "arena_hard_001",
"text": "Explain the difference between P and NP complexity classes. Can you give an intuitive example that a non-computer scientist would understand?",
"response_a": "P and NP are complexity classes in computer science that categorize problems by how hard they are to solve.\n\nP (Polynomial time): These are problems that can be solved efficiently. Think of sorting a deck of cards - there are well-known methods that work reasonably fast even for large decks.\n\nNP (Nondeterministic Polynomial time): These are problems where, if someone gives you a proposed solution, you can verify it quickly, but finding the solution from scratch might take an extremely long time.\n\nIntuitive example: Imagine a jigsaw puzzle. Checking if a completed puzzle is correct (verifying) is easy - just look at it. But actually solving a 10,000-piece puzzle from scratch takes much longer. The puzzle is in NP because verification is easy, but it might not be in P because solving it is hard.\n\nThe famous P vs NP question asks: Is every problem whose solution can be quickly verified also quickly solvable? Most computer scientists believe the answer is no, but nobody has proven it yet.",
"response_b": "P and NP are about how fast computers can solve problems.\n\nP means problems computers can solve quickly. NP means problems where you can check answers quickly but maybe can't find answers quickly.\n\nLike a Sudoku puzzle - checking if a filled-in Sudoku is correct is easy, but solving one from scratch is harder. Nobody knows if there's a fast way to solve all NP problems, and there's a million-dollar prize for proving it either way."
},
{
"id": "arena_hard_002",
"text": "Write a Python function to find all prime factors of a given number, and then analyze its time complexity.",
"response_a": "```python\ndef prime_factors(n):\n factors = []\n d = 2\n while d * d <= n:\n while n % d == 0:\n factors.append(d)\n n //= d\n d += 1\n if n > 1:\n factors.append(n)\n return factors\n```\n\nTime complexity analysis:\n- The outer loop runs up to sqrt(n) times, as we only check divisors up to sqrt(n)\n- The inner loop divides n by d each time, reducing n\n- Overall: O(sqrt(n)) in the worst case (when n is prime)\n- Space complexity: O(log n) for storing factors, since the number of prime factors is at most log2(n)\n\nExample: prime_factors(84) returns [2, 2, 3, 7]",
"response_b": "Here's a solution:\n\n```python\ndef prime_factors(n):\n factors = []\n # Check for factor 2\n while n % 2 == 0:\n factors.append(2)\n n //= 2\n # Check odd factors from 3 to sqrt(n)\n for i in range(3, int(n**0.5) + 1, 2):\n while n % i == 0:\n factors.append(i)\n n //= i\n if n > 1:\n factors.append(n)\n return factors\n```\n\nTime Complexity: O(sqrt(n))\n- We handle 2 separately, then only check odd numbers\n- This roughly halves the iterations compared to checking every number\n- The while loop inside reduces n, so in practice it's often faster than sqrt(n)\n- Best case: O(log n) when n is a power of 2\n- Worst case: O(sqrt(n)) when n is prime\n\nSpace: O(log n) for the factors list, since n has at most log2(n) prime factors.\n\nThe optimization of checking 2 first and then only odd numbers gives a constant factor improvement of ~2x over the naive approach."
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/arena-hard-auto potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
MT-Bench Judge Consistency Evaluation
Multi-turn conversation evaluation for LLM judge consistency, based on MT-Bench (Zheng et al., NeurIPS 2023). Annotators compare two assistant responses in a pairwise setting, rate overall quality on a 1-10 Likert scale, and classify the conversation category.
Chatbot Arena - Pairwise Comparison with Best-Worst Scaling
Pairwise comparison and best-worst scaling of chatbot responses, based on the Chatbot Arena framework (Zheng et al., ICML 2024). Annotators compare pairs of LLM-generated responses and rank sets of responses using best-worst scaling methodology.
MT-Bench: LLM Response Quality Evaluation
LLM response quality evaluation using pairwise comparison and absolute scoring. Annotators compare pairs of LLM responses and rate individual responses on a 1-10 scale across multiple turns.