Showcase/MT-Bench: Multi-Turn LLM Evaluation Benchmark

intermediatesurvey

MT-Bench: Multi-Turn LLM Evaluation Benchmark

MT-Bench is an 80-question multi-turn benchmark for rating LLM chat assistants with an LLM judge, from Zheng et al. (NeurIPS 2023). This Potato config reproduces its 1-10 single-answer grading.

About this dataset

MT-Bench is a benchmark for evaluating the conversation and instruction-following ability of LLM chat assistants. It was introduced by Lianmin Zheng and collaborators in "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," published at NeurIPS 2023.

The benchmark contains 80 high-quality questions spread across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science). Each category has 10 questions, and every question is a 2-turn exchange, so a model produces a first-turn and a follow-up answer.

MT-Bench popularized the LLM-as-a-judge approach. A strong model such as GPT-4 scores each turn on a scale of 1 to 10 using a fixed prompt template, and a model's final score is averaged over all 160 turns (80 questions x 2 turns). The authors found GPT-4 judgments agree with human preferences over 80 percent of the time, matching the agreement rate between humans.

The Potato config below reproduces MT-Bench single-answer grading: an annotator reads a question and a model's two-turn response, then assigns a 1-10 quality score per turn, the same rating task GPT-4 performs as a judge.

Questions: 80
Categories: 8
Questions per category: 10
Turns per question: 2
Judge score scale: 1 to 10
GPT-4 vs. human agreement: Over 80%

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# MT-Bench: LLM Response Quality Evaluation
# Based on "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., NeurIPS 2023)
# Task: Pairwise comparison and absolute scoring of LLM responses

annotation_task_name: "MT-Bench LLM Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing question and both responses side-by-side
html_layout: |
  <div class="mtbench-container">
    <div class="metadata-bar" style="display: flex; gap: 15px; margin-bottom: 15px;">
      <div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px;">
        <strong>Category:</strong> {{category}}
      </div>
      <div style="background: #f3e5f5; padding: 8px 15px; border-radius: 8px;">
        <strong>Turn:</strong> {{turn_number}}
      </div>
    </div>
    <div class="question-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 20px;">
      <h3 style="margin-top: 0;">Question:</h3>
      <div class="question-text" style="font-size: 16px;">{{text}}</div>
    </div>
    <div class="responses-section" style="display: flex; gap: 20px;">
      <div class="response-a" style="flex: 1; background: #e3f2fd; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
        <div class="response-text" style="white-space: pre-wrap; font-size: 14px;">{{response_a}}</div>
      </div>
      <div class="response-b" style="flex: 1; background: #fce4ec; padding: 15px; border-radius: 8px; border: 2px solid #c62828;">
        <h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
        <div class="response-text" style="white-space: pre-wrap; font-size: 14px;">{{response_b}}</div>
      </div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Pairwise preference
  - name: "preference"
    description: "Which response is better overall? Consider helpfulness, accuracy, depth, and clarity."
    annotation_type: pairwise
    labels:
      - "Response A is much better"
      - "Response A is slightly better"
      - "Tie"
      - "Response B is slightly better"
      - "Response B is much better"
    keyboard_shortcuts:
      "Response A is much better": "1"
      "Response A is slightly better": "2"
      "Tie": "3"
      "Response B is slightly better": "4"
      "Response B is much better": "5"

  # Absolute score for Response A
  - name: "score_response_a"
    description: "Rate Response A on a 1-10 scale for overall quality."
    annotation_type: likert
    size: 10
    min_label: "1 - Terrible"
    max_label: "10 - Perfect"

  # Absolute score for Response B
  - name: "score_response_b"
    description: "Rate Response B on a 1-10 scale for overall quality."
    annotation_type: likert
    size: 10
    min_label: "1 - Terrible"
    max_label: "10 - Perfect"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 80
annotation_per_instance: 3

Sample Datasample-data.json

json

[
  {
    "id": "mt_001",
    "text": "Compose a short poem about the beauty of mathematics.",
    "response_a": "In numbers' dance, a truth unfolds,\nWhere logic weaves its threads of gold.\nFrom Euler's grace to pi's embrace,\nInfinity finds its resting place.\n\nEquations bloom like garden flowers,\nRevealing nature's hidden powers.\nIn every proof, a story told,\nOf beauty that will never grow old.",
    "response_b": "Math is beautiful because it has patterns.\nNumbers are interesting.\nGeometry has nice shapes.\nCalculus helps us understand change.\nMath is used in many fields.\nIt is important for science.",
    "category": "Writing",
    "turn_number": "1"
  },
  {
    "id": "mt_002",
    "text": "Explain the concept of opportunity cost in economics and give a real-world example.",
    "response_a": "Opportunity cost is the value of the next best alternative you give up when making a choice. It's a fundamental concept in economics that helps us understand the true cost of decisions.\n\nReal-world example: Say you have $50,000 saved and you're deciding between investing it in stocks or using it as a down payment on a rental property. If you choose stocks, the opportunity cost is the rental income and property appreciation you would have gained. If you choose the property, the opportunity cost is the potential stock market returns.\n\nThis concept reminds us that every choice has a hidden cost - what we give up by not choosing the alternative.",
    "response_b": "Opportunity cost refers to the potential benefit an individual, investor, or business misses out on when choosing one alternative over another. Because every resource (time, money, etc.) can be put to alternative uses, every action, choice, or decision has an associated opportunity cost.\n\nExample: Imagine you have a free Saturday. You could either work overtime for $200 or attend a friend's concert. If you go to the concert, your opportunity cost is the $200 you could have earned. Conversely, if you work, your opportunity cost is the enjoyment and social connection from the concert.\n\nImportantly, opportunity cost isn't always monetary - it includes intangible benefits like happiness, relationships, and personal growth.",
    "category": "Reasoning",
    "turn_number": "1"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/mtbench-llm-evaluation
potato start config.yaml

Dataset & paper

Zheng et al., NeurIPS 2023

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{zheng2023judging,
  title={Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena},
  author={Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and others},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2023}
}

Details

Annotation Types

pairwiselikert

Domain

NLPLLM Evaluation

Use Cases

LLM BenchmarkingModel ComparisonQuality Assessment

Related Designs

WildBench - LLM Evaluation on Real-World Tasks

Evaluation of LLM outputs on challenging real-world user queries from WildBench. Annotators compare two model responses via pairwise preference, rate overall quality on a Likert scale, and provide reasoning for their judgments.

pairwiselikert

MT-Bench Judge Consistency Evaluation

Multi-turn conversation evaluation for LLM judge consistency, based on MT-Bench (Zheng et al., NeurIPS 2023). Annotators compare two assistant responses in a pairwise setting, rate overall quality on a 1-10 Likert scale, and classify the conversation category.

pairwiselikert

Arena Hard Auto - LLM Pairwise Evaluation

Pairwise evaluation of LLM responses on challenging prompts from the Arena Hard benchmark (Li et al., arXiv 2024). Annotators compare two responses on a continuous scale and rate question difficulty.

pairwiselikert

MT-Bench: Multi-Turn LLM Evaluation Benchmark

About this dataset

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Dataset & paper

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

WildBench - LLM Evaluation on Real-World Tasks

MT-Bench Judge Consistency Evaluation

Arena Hard Auto - LLM Pairwise Evaluation