MT-Bench Judge Consistency Evaluation

Multi-turn conversation evaluation for LLM judge consistency, based on MT-Bench (Zheng et al., NeurIPS 2023). Annotators compare two assistant responses in a pairwise setting, rate overall quality on a 1-10 Likert scale, and classify the conversation category.

Configuration Fileconfig.yaml

# MT-Bench Judge Consistency Evaluation
# Based on Zheng et al., NeurIPS 2023
# Paper: https://arxiv.org/abs/2306.05685
# Dataset: https://huggingface.co/datasets/lmsys/mt_bench_human_judgments
#
# Multi-turn conversation evaluation for measuring LLM judge consistency.
# Annotators compare two assistant responses via pairwise preference, rate
# overall quality on a 1-10 scale, and classify the conversation category.
# This data helps calibrate automated LLM judges against human judgments.
#
# Pairwise Labels:
# - Assistant A: Response A is better
# - Assistant B: Response B is better
# - Tie: Both responses are equally good
#
# Categories (MT-Bench taxonomy):
# - Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities
#
# Annotation Guidelines:
# 1. Read the conversation context carefully
# 2. Compare both assistant responses
# 3. Select your preference or indicate a tie
# 4. Rate the overall quality of the conversation on a 1-10 scale
# 5. Classify the conversation category

annotation_task_name: "MT-Bench Judge Consistency Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Pairwise preference
  - annotation_type: pairwise
    name: preference
    description: "Which assistant response is better?"
    mode: "binary"
    labels:
      - "Assistant A"
      - "Assistant B"
      - "Tie"
    keyboard_shortcuts:
      "Assistant A": "a"
      "Assistant B": "b"
      "Tie": "t"
    tooltips:
      "Assistant A": "Response A is clearly better overall"
      "Assistant B": "Response B is clearly better overall"
      "Tie": "Both responses are of comparable quality"

  # Step 2: Quality rating on 1-10 scale
  - annotation_type: likert
    name: quality_rating
    description: "Rate the overall quality of the responses on a 1-10 scale."
    min_label: "1"
    max_label: "10"
    size: 10

  # Step 3: Category classification
  - annotation_type: radio
    name: category
    description: "What category does this conversation belong to?"
    labels:
      - "Writing"
      - "Roleplay"
      - "Reasoning"
      - "Math"
      - "Coding"
      - "Extraction"
      - "STEM"
      - "Humanities"
    keyboard_shortcuts:
      "Writing": "1"
      "Roleplay": "2"
      "Reasoning": "3"
      "Math": "4"
      "Coding": "5"
      "Extraction": "6"
      "STEM": "7"
      "Humanities": "8"
    tooltips:
      "Writing": "Creative writing, editing, summarization"
      "Roleplay": "Role-playing scenarios, character simulation"
      "Reasoning": "Logical and analytical reasoning tasks"
      "Math": "Mathematical problems and calculations"
      "Coding": "Programming and software development tasks"
      "Extraction": "Information extraction and data parsing"
      "STEM": "Science, technology, engineering topics"
      "Humanities": "History, philosophy, social sciences, arts"

annotation_instructions: |
  You will evaluate pairs of LLM assistant responses from the MT-Bench benchmark.

  For each item:
  1. Read the conversation context carefully.
  2. Compare Response A and Response B thoroughly.
  3. Select which assistant is better, or indicate a Tie.
  4. Rate the overall conversation quality on a 1-10 scale.
  5. Classify the conversation category.

  Rating Scale (1-10):
  - 1-2: Very poor, fails to address the query
  - 3-4: Below average, significant issues
  - 5-6: Average, adequate but room for improvement
  - 7-8: Good, addresses the query well with minor issues
  - 9-10: Excellent, comprehensive and high-quality response

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px; margin-bottom: 16px;">
      <strong>Category:</strong> {{category}}
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Conversation:</strong>
      <div style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0; white-space: pre-wrap;">{{text}}</div>
    </div>
    <div style="display: flex; gap: 16px;">
      <div style="flex: 1; background: #e3f2fd; border: 2px solid #1976d2; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_a}}</div>
      </div>
      <div style="flex: 1; background: #fce4ec; border: 2px solid #c62828; border-radius: 8px; padding: 16px;">
        <h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6;">{{response_b}}</div>
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "mtb_001",
    "text": "Write a creative short story about a time-traveling scientist who accidentally changes a small detail in the past.",
    "response_a": "Dr. Elena Vasquez triple-checked her temporal coordinates before activating the displacement field. June 14, 1962 - a quiet Tuesday in Cambridge, Massachusetts. She materialized in an alley behind a coffee shop, adjusted her period-appropriate clothing, and stepped onto the sidewalk.\n\nHer mission was simple: observe the daily routine of a young graduate student who would later publish a groundbreaking paper on chaos theory. Pure observation, no interaction.\n\nBut as she crossed the street, she stumbled on an uneven cobblestone. A man behind her - the very student she'd come to observe - paused to pick up the notebook she'd dropped. Their eyes met. He smiled.\n\n\"Interesting binding on this notebook,\" he said, examining the spiral design that wouldn't be commercially available for another three years.\n\nBack in her own time, Elena found the world largely unchanged - except for one thing. Spiral-bound notebooks had been invented in 1962 instead of 1965. The graduate student, inspired by a stranger's unusual notebook, had sketched the design and shown it to a friend in manufacturing.\n\nA small change. Insignificant, really. But Elena noticed something else in the historical record: that three-year head start in notebook technology had given one particular student access to better note-taking tools slightly earlier, leading them to organize their research differently, which led to a paper being published six months ahead of schedule.\n\nThat paper had cited a source that was later retracted.\n\nThe butterfly had flapped its wings.",
    "response_b": "Professor James went back in time to 1985. He accidentally stepped on a flower. When he returned to the present, everything looked the same except now everyone's favorite color was blue instead of red. He tried to go back and fix it but his time machine was broken. The end.",
    "category": "Writing"
  },
  {
    "id": "mtb_002",
    "text": "If a store offers 30% off on a $250 jacket and then an additional 15% off at checkout, what is the final price? Show your work.",
    "response_a": "Let me calculate step by step:\n\n1. Original price: $250\n2. First discount (30% off): $250 x 0.30 = $75 off\n   Price after first discount: $250 - $75 = $175\n3. Second discount (15% off the discounted price): $175 x 0.15 = $26.25 off\n   Final price: $175 - $26.25 = $148.75\n\nThe final price is $148.75.\n\nNote: The discounts are applied sequentially, not combined. A 30% + 15% discount is NOT the same as a 45% discount. If it were 45% off: $250 x 0.55 = $137.50, which is lower than $148.75.",
    "response_b": "Step 1: Apply the 30% discount\n$250 * (1 - 0.30) = $250 * 0.70 = $175.00\n\nStep 2: Apply the additional 15% discount to the reduced price\n$175 * (1 - 0.15) = $175 * 0.85 = $148.75\n\nFinal price: $148.75\n\nAlternatively, you can calculate this in one step by multiplying the discount factors:\n$250 * 0.70 * 0.85 = $148.75\n\nThis is equivalent to a total discount of about 40.5% (not 45%, since the second discount applies to the already-reduced price).",
    "category": "Math"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/mt-bench-judge-consistency
potato start config.yaml

Details

Annotation Types

pairwiselikertradio

Domain

NLPAI Evaluation

Use Cases

LLM EvaluationJudge ConsistencyMulti-Turn Dialogue

Related Designs

Arena Hard Auto - LLM Pairwise Evaluation

Pairwise evaluation of LLM responses on challenging prompts from the Arena Hard benchmark (Li et al., arXiv 2024). Annotators compare two responses on a continuous scale and rate question difficulty.

pairwiselikert

AnnoMI Counselling Dialogue Annotation

Annotation of motivational interviewing counselling dialogues based on the AnnoMI dataset. Annotators label therapist and client utterances for MI techniques (open questions, reflections, affirmations) and client change talk (sustain talk, change talk), with quality ratings for therapeutic interactions.

radiomultiselect

Argument Reasoning Comprehension (ARCT)

Identify implicit warrants in arguments. Based on Habernal et al., NAACL 2018 / SemEval 2018 Task 12. Given a claim and premise, choose the correct warrant that connects them.

likertradio