Showcase/Chatbot Arena: Pairwise LLM Preference Evaluation

intermediatepreference

Chatbot Arena: Pairwise LLM Preference Evaluation

Chatbot Arena collects human pairwise preference votes between anonymous LLM responses to rank models with a Bradley-Terry leaderboard. This Potato config reproduces the pairwise and best-worst scaling judgment task.

About this dataset

Chatbot Arena is a crowdsourced platform for evaluating large language models through head-to-head human preference votes. It was created by LMSYS (the Large Model Systems Organization), and the methodology was published by Zheng et al. as "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" at ICML 2024. Users chat with two anonymous models side by side and vote for the better response.

The paired companion release, chatbot_arena_conversations, contains 33,000 conversations gathered from 13,383 unique users between April and June 2023, covering 20 models and 96 languages. Each record stores the two model names, the full conversation, and the user vote. The live Arena aggregates these pairwise votes into ratings using the Bradley-Terry model, the maximum-likelihood form behind the published Elo-style leaderboard.

In a pairwise judgment, an annotator sees two responses to the same prompt and picks the one they prefer, or marks a tie. Best-worst scaling (BWS) extends this to a set of responses by asking the annotator to select the single best and single worst item, which yields ranking information more efficiently than scoring each item alone. Both methods convert subjective quality into comparison counts that downstream models turn into scores.

The Potato config below reproduces the pairwise and best-worst scaling judgment task so you can collect the same human preference signals on your own model outputs.

Conversations (released set): 33,000
Unique users: 13,383
Models compared: 20
Languages: 96
Collection period: Apr-Jun 2023
Ratings method: Pairwise votes -> Bradley-Terry

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Chatbot Arena - Pairwise Comparison with Best-Worst Scaling
# Based on Zheng et al., ICML 2024
# Paper: https://arxiv.org/abs/2403.04132
# Dataset: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
#
# This task combines two complementary evaluation approaches:
# 1. Pairwise comparison: Directly compare two chatbot responses
# 2. Best-Worst Scaling: Rank responses from a tuple of 4 items
#
# Evaluation criteria:
# - Helpfulness: Does the response address the user's request?
# - Accuracy: Is the information factually correct?
# - Coherence: Is the response well-organized and fluent?
# - Safety: Does the response avoid harmful content?

annotation_task_name: "Chatbot Arena: Pairwise Comparison with BWS"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: pairwise
    name: response_preference
    description: "Which response better addresses the user prompt?"
    mode: "binary"
    labels:
      - "Response A"
      - "Response B"
      - "Tie"

  - annotation_type: bws
    name: response_quality_ranking
    description: "Best-worst scaling: From a set of responses, select the best and worst. This enables efficient ranking of multiple LLM outputs."
    tuple_size: 4

annotation_instructions: |
  You will evaluate chatbot responses to user prompts.

  For pairwise comparison:
  1. Read the user prompt carefully.
  2. Read both Response A and Response B.
  3. Select which response is better, or "Tie" if they are equally good.
  4. Consider helpfulness, accuracy, coherence, and safety.

  For best-worst scaling:
  1. You will see a set of 4 responses.
  2. Select the BEST and WORST response from the set.
  3. Focus on overall quality and helpfulness.

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">User Prompt:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 16px;">
      <div style="background: #fef3c7; border: 1px solid #fde68a; border-radius: 8px; padding: 16px;">
        <strong style="color: #a16207;">Response A:</strong>
        <p style="font-size: 15px; line-height: 1.6; margin: 8px 0 0 0;">{{response_a}}</p>
      </div>
      <div style="background: #e0f2fe; border: 1px solid #7dd3fc; border-radius: 8px; padding: 16px;">
        <strong style="color: #0369a1;">Response B:</strong>
        <p style="font-size: 15px; line-height: 1.6; margin: 8px 0 0 0;">{{response_b}}</p>
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "arena_001",
    "text": "Explain the difference between machine learning and deep learning in simple terms.",
    "response_a": "Machine learning is a subset of AI where computers learn patterns from data. It includes techniques like decision trees, SVMs, and linear regression. Deep learning is a subset of machine learning that uses neural networks with many layers (hence 'deep'). These deep neural networks can automatically learn hierarchical features from raw data, making them especially powerful for tasks like image recognition and natural language processing.",
    "response_b": "Think of it like this: machine learning is like teaching a computer to recognize patterns using rules and examples. Deep learning takes this further by using brain-inspired networks that can figure out the important features on their own. For example, to recognize cats in photos, traditional ML needs you to define features (ear shape, whiskers), while deep learning figures out those features automatically from thousands of cat pictures."
  },
  {
    "id": "arena_002",
    "text": "Write a short poem about the ocean at sunset.",
    "response_a": "Golden light upon the waves,\nThe sun descends in amber haze,\nSalt-kissed air and seabirds' call,\nAs evening's curtain starts to fall.\nThe horizon burns with crimson fire,\nWhile tides sing soft, a liquid choir.",
    "response_b": "The ocean blushes pink and gold\nAs daylight's final tale is told.\nWaves reach up to touch the sun\nBefore the day is finally done.\nStarfish dream beneath the foam\nWhile seagulls slowly glide toward home.\nThe sky, a watercolor spill,\nReflects on waters, calm and still."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/chatbot-arena-pairwise-bws
potato start config.yaml

Dataset & paper

Zheng et al., NeurIPS 2023

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{zheng2023judging,
    title = "Judging {LLM}-as-a-Judge with {MT}-Bench and Chatbot Arena",
    author = "Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion",
    booktitle = "Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track",
    year = "2023",
    url = "https://arxiv.org/abs/2306.05685"
}

Details

Annotation Types

bwspairwise

Domain

NLPAI Evaluation

Use Cases

LLM EvaluationPreference LearningChatbot Comparison

Related Designs

Arena Hard Auto - LLM Pairwise Evaluation

Pairwise evaluation of LLM responses on challenging prompts from the Arena Hard benchmark (Li et al., ICML 2025). Annotators compare two responses on a continuous scale and rate question difficulty.

pairwiselikert

MT-Bench Judge Consistency Evaluation

Multi-turn conversation evaluation for LLM judge consistency, based on MT-Bench (Zheng et al., NeurIPS 2023). Annotators compare two assistant responses in a pairwise setting, rate overall quality on a 1-10 Likert scale, and classify the conversation category.