WildBench - LLM Evaluation on Real-World Tasks

Evaluation of LLM outputs on challenging real-world user queries from WildBench. Annotators compare two model responses via pairwise preference, rate overall quality on a Likert scale, and provide reasoning for their judgments.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# WildBench - LLM Evaluation on Real-World Tasks
# Based on Lin et al., COLM 2024
# Paper: https://arxiv.org/abs/2406.04770
# Dataset: https://huggingface.co/datasets/allenai/WildBench
#
# This task evaluates LLM outputs on challenging real-world user queries.
# Annotators compare two model responses via pairwise preference, rate
# overall quality on a Likert scale, and provide written reasoning.
#
# Evaluation Criteria:
# - Helpfulness: Does the response address the user's needs?
# - Accuracy: Is the information correct and reliable?
# - Depth: Does the response provide sufficient detail?
# - Clarity: Is the response well-organized and easy to follow?
#
# Annotation Guidelines:
# 1. Read the user query carefully to understand the intent
# 2. Read both model responses thoroughly
# 3. Select which response is better overall, or mark as tie
# 4. Rate the overall quality of the better response on a 1-5 scale
# 5. Provide a brief explanation of your reasoning

annotation_task_name: "WildBench LLM Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Pairwise comparison
  - annotation_type: pairwise
    name: preference
    description: "Which model response better addresses the user query?"
    mode: "binary"
    labels:
      - "Model A Better"
      - "Model B Better"
      - "Tie"
    keyboard_shortcuts:
      "Model A Better": "a"
      "Model B Better": "b"
      "Tie": "t"
    tooltips:
      "Model A Better": "Response A is clearly superior in addressing the query"
      "Model B Better": "Response B is clearly superior in addressing the query"
      "Tie": "Both responses are roughly equal in quality"

  # Step 2: Overall quality rating
  - annotation_type: likert
    name: overall_quality
    description: "Rate the overall quality of the responses on a 1-5 scale."
    min_label: "Very Poor"
    max_label: "Excellent"
    size: 5

  # Step 3: Written reasoning
  - annotation_type: text
    name: reasoning
    description: "Briefly explain why you preferred one response over the other."
    textarea: true
    required: false
    placeholder: "What factors influenced your preference judgment?"

annotation_instructions: |
  You will evaluate pairs of LLM responses to real-world user queries from WildBench.

  For each item:
  1. Read the user query carefully.
  2. Read both model responses (A and B) thoroughly.
  3. Select which response is better, or mark as Tie.
  4. Rate the overall quality on a 1-5 scale.
  5. Explain your reasoning briefly.

  Consider: helpfulness, accuracy, depth, clarity, and relevance to the query.

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">User Query:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: flex; gap: 16px;">
      <div style="flex: 1; background: #e3f2fd; border: 2px solid #1976d2; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_a}}</div>
      </div>
      <div style="flex: 1; background: #fce4ec; border: 2px solid #c62828; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_b}}</div>
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "wildbench_001",
    "text": "I need to write a resignation letter for my current position as a software engineer. I want to keep it professional but also express gratitude for the opportunities I've had. Can you help me draft one?",
    "response_a": "Subject: Resignation - [Your Name]\n\nDear [Manager's Name],\n\nI am writing to formally notify you of my resignation from my position as Software Engineer at [Company], effective [last working day, typically two weeks from now].\n\nI want to express my sincere gratitude for the opportunities I've had during my time here. Working on [specific project] was a highlight of my career, and I've grown significantly as an engineer thanks to the mentorship and collaborative environment.\n\nI am committed to ensuring a smooth transition. I will document my current projects and am happy to help train my replacement during my remaining time.\n\nThank you for everything.\n\nBest regards,\n[Your Name]",
    "response_b": "Here's a resignation letter:\n\nDear Boss,\n\nI quit. My last day will be in two weeks.\n\nThanks for the job. I learned some stuff here.\n\nBye,\n[Name]"
  },
  {
    "id": "wildbench_002",
    "text": "Explain the difference between TCP and UDP protocols. When would you choose one over the other? Give real examples.",
    "response_a": "TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are both transport layer protocols, but they differ in key ways:\n\n**TCP:**\n- Connection-oriented: establishes a connection before data transfer (3-way handshake)\n- Reliable: guarantees delivery, ordering, and error-checking\n- Slower due to overhead\n- Use cases: web browsing (HTTP/HTTPS), email (SMTP), file transfer (FTP)\n\n**UDP:**\n- Connectionless: sends data without establishing a connection\n- Unreliable: no guarantee of delivery or ordering\n- Faster with less overhead\n- Use cases: video streaming, online gaming, DNS queries, VoIP\n\n**When to choose:**\n- Choose TCP when data integrity is critical (financial transactions, file downloads)\n- Choose UDP when speed matters more than reliability (live streaming, gaming where occasional packet loss is acceptable)",
    "response_b": "TCP and UDP are networking protocols. TCP is more reliable and UDP is faster. TCP is used for websites and UDP is used for streaming. They both send data over the internet but in different ways. You should pick the one that fits your needs."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/wildbench-llm-eval
potato start config.yaml

Dataset & paper

Lin et al., COLM 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{lin2024wildbench,
    title = "{W}ild{B}ench: Benchmarking {LLM}s with Challenging Tasks from Real Users in the Wild",
    author = "Lin, Bill Yuchen and Deng, Yuntian and Chandu, Khyathi and Brahman, Faeze and Ravichander, Abhilasha and Pyatkin, Valentina and Dziri, Nouha and Le Bras, Ronan and Choi, Yejin",
    booktitle = "Conference on Language Modeling (COLM)",
    year = "2024",
    url = "https://arxiv.org/abs/2406.04770"
}

Details

Annotation Types

pairwiselikerttext

Domain

NLPLLM Evaluation

Use Cases

LLM BenchmarkingModel ComparisonQuality Assessment

Related Designs

MT-Bench: Multi-Turn LLM Evaluation Benchmark

MT-Bench is an 80-question multi-turn benchmark for rating LLM chat assistants with an LLM judge, from Zheng et al. (NeurIPS 2023). This Potato config reproduces its 1-10 single-answer grading.

pairwiselikert

Prometheus: Rubric-Based LLM Evaluation

Prometheus is an open-source evaluator LM that scores a response against a user-defined rubric on a 1-5 scale and writes feedback. This Potato config reproduces that rubric scoring and feedback task for human annotators.

likerttext

Unlearning Sensitive Content from LLMs

Evaluation of whether language models have successfully unlearned sensitive content, requiring annotators to assess model outputs for residual sensitive information leakage. Based on SemEval-2025 Task 4.

radiotext