Arena Hard Auto - LLM Pairwise Evaluation

Pairwise evaluation of LLM responses on challenging prompts from the Arena Hard benchmark (Li et al., arXiv 2024). Annotators compare two responses on a continuous scale and rate question difficulty.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Arena Hard Auto - LLM Pairwise Evaluation
# Based on Li et al., arXiv 2024
# Paper: https://arxiv.org/abs/2406.11939
# Dataset: https://github.com/lm-sys/arena-hard-auto
#
# Pairwise evaluation of LLM responses on challenging prompts from the
# Arena Hard benchmark. Uses a continuous scale for nuanced preference
# judgments rather than simple binary choices. Annotators also rate the
# difficulty of each question.
#
# Pairwise Scale:
# - -3: Response A is much better
# -  0: Tie (both responses are equally good)
# -  3: Response B is much better
#
# Difficulty Rating:
# - 1 (Very Easy) to 5 (Very Hard)
#
# Annotation Guidelines:
# 1. Read the prompt carefully
# 2. Read both responses thoroughly
# 3. Use the scale to indicate your preference strength
# 4. Rate how difficult the prompt is to answer well

annotation_task_name: "Arena Hard Auto - LLM Pairwise Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Pairwise preference on a scale
  - annotation_type: pairwise
    name: preference
    description: "Compare the two responses. Use the scale to indicate how much better one is than the other."
    mode: "scale"
    scale:
      min: -3
      max: 3
      step: 1
      labels:
        "-3": "A much better"
        "0": "Tie"
        "3": "B much better"

  # Step 2: Difficulty rating
  - annotation_type: likert
    name: difficulty
    description: "How difficult is this prompt to answer well?"
    min_label: "Very Easy"
    max_label: "Very Hard"
    size: 5

annotation_instructions: |
  You will evaluate pairs of LLM responses on challenging prompts from the Arena Hard benchmark.

  For each item:
  1. Read the user prompt carefully to understand what is being asked.
  2. Read both Response A and Response B thoroughly.
  3. Use the continuous scale (-3 to +3) to indicate your preference:
     - -3: Response A is much better
     - -2: Response A is better
     - -1: Response A is slightly better
     -  0: Tie (both are equally good or equally bad)
     - +1: Response B is slightly better
     - +2: Response B is better
     - +3: Response B is much better
  4. Rate how difficult the prompt is to answer well (1 = Very Easy, 5 = Very Hard).

  Consider: accuracy, completeness, clarity, helpfulness, and relevance.

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Prompt:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: flex; gap: 16px;">
      <div style="flex: 1; background: #e3f2fd; border: 2px solid #1976d2; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_a}}</div>
      </div>
      <div style="flex: 1; background: #fce4ec; border: 2px solid #c62828; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_b}}</div>
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "arena_hard_001",
    "text": "Explain the difference between P and NP complexity classes. Can you give an intuitive example that a non-computer scientist would understand?",
    "response_a": "P and NP are complexity classes in computer science that categorize problems by how hard they are to solve.\n\nP (Polynomial time): These are problems that can be solved efficiently. Think of sorting a deck of cards - there are well-known methods that work reasonably fast even for large decks.\n\nNP (Nondeterministic Polynomial time): These are problems where, if someone gives you a proposed solution, you can verify it quickly, but finding the solution from scratch might take an extremely long time.\n\nIntuitive example: Imagine a jigsaw puzzle. Checking if a completed puzzle is correct (verifying) is easy - just look at it. But actually solving a 10,000-piece puzzle from scratch takes much longer. The puzzle is in NP because verification is easy, but it might not be in P because solving it is hard.\n\nThe famous P vs NP question asks: Is every problem whose solution can be quickly verified also quickly solvable? Most computer scientists believe the answer is no, but nobody has proven it yet.",
    "response_b": "P and NP are about how fast computers can solve problems.\n\nP means problems computers can solve quickly. NP means problems where you can check answers quickly but maybe can't find answers quickly.\n\nLike a Sudoku puzzle - checking if a filled-in Sudoku is correct is easy, but solving one from scratch is harder. Nobody knows if there's a fast way to solve all NP problems, and there's a million-dollar prize for proving it either way."
  },
  {
    "id": "arena_hard_002",
    "text": "Write a Python function to find all prime factors of a given number, and then analyze its time complexity.",
    "response_a": "```python\ndef prime_factors(n):\n    factors = []\n    d = 2\n    while d * d <= n:\n        while n % d == 0:\n            factors.append(d)\n            n //= d\n        d += 1\n    if n > 1:\n        factors.append(n)\n    return factors\n```\n\nTime complexity analysis:\n- The outer loop runs up to sqrt(n) times, as we only check divisors up to sqrt(n)\n- The inner loop divides n by d each time, reducing n\n- Overall: O(sqrt(n)) in the worst case (when n is prime)\n- Space complexity: O(log n) for storing factors, since the number of prime factors is at most log2(n)\n\nExample: prime_factors(84) returns [2, 2, 3, 7]",
    "response_b": "Here's a solution:\n\n```python\ndef prime_factors(n):\n    factors = []\n    # Check for factor 2\n    while n % 2 == 0:\n        factors.append(2)\n        n //= 2\n    # Check odd factors from 3 to sqrt(n)\n    for i in range(3, int(n**0.5) + 1, 2):\n        while n % i == 0:\n            factors.append(i)\n            n //= i\n    if n > 1:\n        factors.append(n)\n    return factors\n```\n\nTime Complexity: O(sqrt(n))\n- We handle 2 separately, then only check odd numbers\n- This roughly halves the iterations compared to checking every number\n- The while loop inside reduces n, so in practice it's often faster than sqrt(n)\n- Best case: O(log n) when n is a power of 2\n- Worst case: O(sqrt(n)) when n is prime\n\nSpace: O(log n) for the factors list, since n has at most log2(n) prime factors.\n\nThe optimization of checking 2 first and then only odd numbers gives a constant factor improvement of ~2x over the naive approach."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/arena-hard-auto
potato start config.yaml

Dataset & paper

Li et al., arXiv 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{li2024crowdsourced,
    title = "From Crowdsourced Data to High-Quality Benchmarks: Arena Hard and BenchBuilder Pipeline",
    author = "Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E and Stoica, Ion",
    journal = "arXiv preprint arXiv:2406.11939",
    year = "2024",
    url = "https://arxiv.org/abs/2406.11939"
}

Details

Annotation Types

pairwiselikert

Domain

NLPAI Evaluation

Use Cases

LLM EvaluationPreference LearningBenchmark Annotation

Related Designs

MT-Bench Judge Consistency Evaluation

Multi-turn conversation evaluation for LLM judge consistency, based on MT-Bench (Zheng et al., NeurIPS 2023). Annotators compare two assistant responses in a pairwise setting, rate overall quality on a 1-10 Likert scale, and classify the conversation category.

pairwiselikert

Chatbot Arena: Pairwise LLM Preference Evaluation

Chatbot Arena collects human pairwise preference votes between anonymous LLM responses to rank models with a Bradley-Terry leaderboard. This Potato config reproduces the pairwise and best-worst scaling judgment task.

bwspairwise

MT-Bench: Multi-Turn LLM Evaluation Benchmark

MT-Bench is an 80-question multi-turn benchmark for rating LLM chat assistants with an LLM judge, from Zheng et al. (NeurIPS 2023). This Potato config reproduces its 1-10 single-answer grading.

pairwiselikert

Arena Hard Auto - LLM Pairwise Evaluation

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Dataset & paper

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

MT-Bench Judge Consistency Evaluation

Chatbot Arena: Pairwise LLM Preference Evaluation

MT-Bench: Multi-Turn LLM Evaluation Benchmark