Skip to content
Showcase/Chatbot Arena - Pairwise Comparison with Best-Worst Scaling
intermediatepreference

Chatbot Arena - Pairwise Comparison with Best-Worst Scaling

Pairwise comparison and best-worst scaling of chatbot responses, based on the Chatbot Arena framework (Zheng et al., ICML 2024). Annotators compare pairs of LLM-generated responses and rank sets of responses using best-worst scaling methodology.

Which response is better?Response AResponse BvsConfidence:Submit

Configuration Fileconfig.yaml

# Chatbot Arena - Pairwise Comparison with Best-Worst Scaling
# Based on Zheng et al., ICML 2024
# Paper: https://arxiv.org/abs/2403.04132
# Dataset: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
#
# This task combines two complementary evaluation approaches:
# 1. Pairwise comparison: Directly compare two chatbot responses
# 2. Best-Worst Scaling: Rank responses from a tuple of 4 items
#
# Evaluation criteria:
# - Helpfulness: Does the response address the user's request?
# - Accuracy: Is the information factually correct?
# - Coherence: Is the response well-organized and fluent?
# - Safety: Does the response avoid harmful content?

annotation_task_name: "Chatbot Arena: Pairwise Comparison with BWS"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: pairwise
    name: response_preference
    description: "Which response better addresses the user prompt?"
    mode: "binary"
    labels:
      - "Response A"
      - "Response B"
      - "Tie"

  - annotation_type: bws
    name: response_quality_ranking
    description: "Best-worst scaling: From a set of responses, select the best and worst. This enables efficient ranking of multiple LLM outputs."
    tuple_size: 4

annotation_instructions: |
  You will evaluate chatbot responses to user prompts.

  For pairwise comparison:
  1. Read the user prompt carefully.
  2. Read both Response A and Response B.
  3. Select which response is better, or "Tie" if they are equally good.
  4. Consider helpfulness, accuracy, coherence, and safety.

  For best-worst scaling:
  1. You will see a set of 4 responses.
  2. Select the BEST and WORST response from the set.
  3. Focus on overall quality and helpfulness.

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">User Prompt:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 16px;">
      <div style="background: #fef3c7; border: 1px solid #fde68a; border-radius: 8px; padding: 16px;">
        <strong style="color: #a16207;">Response A:</strong>
        <p style="font-size: 15px; line-height: 1.6; margin: 8px 0 0 0;">{{response_a}}</p>
      </div>
      <div style="background: #e0f2fe; border: 1px solid #7dd3fc; border-radius: 8px; padding: 16px;">
        <strong style="color: #0369a1;">Response B:</strong>
        <p style="font-size: 15px; line-height: 1.6; margin: 8px 0 0 0;">{{response_b}}</p>
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "arena_001",
    "text": "Explain the difference between machine learning and deep learning in simple terms.",
    "response_a": "Machine learning is a subset of AI where computers learn patterns from data. It includes techniques like decision trees, SVMs, and linear regression. Deep learning is a subset of machine learning that uses neural networks with many layers (hence 'deep'). These deep neural networks can automatically learn hierarchical features from raw data, making them especially powerful for tasks like image recognition and natural language processing.",
    "response_b": "Think of it like this: machine learning is like teaching a computer to recognize patterns using rules and examples. Deep learning takes this further by using brain-inspired networks that can figure out the important features on their own. For example, to recognize cats in photos, traditional ML needs you to define features (ear shape, whiskers), while deep learning figures out those features automatically from thousands of cat pictures."
  },
  {
    "id": "arena_002",
    "text": "Write a short poem about the ocean at sunset.",
    "response_a": "Golden light upon the waves,\nThe sun descends in amber haze,\nSalt-kissed air and seabirds' call,\nAs evening's curtain starts to fall.\nThe horizon burns with crimson fire,\nWhile tides sing soft, a liquid choir.",
    "response_b": "The ocean blushes pink and gold\nAs daylight's final tale is told.\nWaves reach up to touch the sun\nBefore the day is finally done.\nStarfish dream beneath the foam\nWhile seagulls slowly glide toward home.\nThe sky, a watercolor spill,\nReflects on waters, calm and still."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/chatbot-arena-pairwise-bws
potato start config.yaml

Details

Annotation Types

bwspairwise

Domain

NLPAI Evaluation

Use Cases

LLM EvaluationPreference LearningChatbot Comparison

Tags

chatbot-arenapairwisebwsllm-evaluationrlhficml2024

Found an issue or want to improve this design?

Open an Issue