Skip to content
Showcase/MT-Bench: LLM Response Quality Evaluation
intermediatesurvey

MT-Bench: LLM Response Quality Evaluation

LLM response quality evaluation using pairwise comparison and absolute scoring. Annotators compare pairs of LLM responses and rate individual responses on a 1-10 scale across multiple turns.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# MT-Bench: LLM Response Quality Evaluation
# Based on "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., NeurIPS 2023)
# Task: Pairwise comparison and absolute scoring of LLM responses

annotation_task_name: "MT-Bench LLM Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing question and both responses side-by-side
html_layout: |
  <div class="mtbench-container">
    <div class="metadata-bar" style="display: flex; gap: 15px; margin-bottom: 15px;">
      <div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px;">
        <strong>Category:</strong> {{category}}
      </div>
      <div style="background: #f3e5f5; padding: 8px 15px; border-radius: 8px;">
        <strong>Turn:</strong> {{turn_number}}
      </div>
    </div>
    <div class="question-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 20px;">
      <h3 style="margin-top: 0;">Question:</h3>
      <div class="question-text" style="font-size: 16px;">{{text}}</div>
    </div>
    <div class="responses-section" style="display: flex; gap: 20px;">
      <div class="response-a" style="flex: 1; background: #e3f2fd; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
        <h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
        <div class="response-text" style="white-space: pre-wrap; font-size: 14px;">{{response_a}}</div>
      </div>
      <div class="response-b" style="flex: 1; background: #fce4ec; padding: 15px; border-radius: 8px; border: 2px solid #c62828;">
        <h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
        <div class="response-text" style="white-space: pre-wrap; font-size: 14px;">{{response_b}}</div>
      </div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Pairwise preference
  - name: "preference"
    description: "Which response is better overall? Consider helpfulness, accuracy, depth, and clarity."
    annotation_type: pairwise
    labels:
      - "Response A is much better"
      - "Response A is slightly better"
      - "Tie"
      - "Response B is slightly better"
      - "Response B is much better"
    keyboard_shortcuts:
      "Response A is much better": "1"
      "Response A is slightly better": "2"
      "Tie": "3"
      "Response B is slightly better": "4"
      "Response B is much better": "5"

  # Absolute score for Response A
  - name: "score_response_a"
    description: "Rate Response A on a 1-10 scale for overall quality."
    annotation_type: likert
    size: 10
    min_label: "1 - Terrible"
    max_label: "10 - Perfect"

  # Absolute score for Response B
  - name: "score_response_b"
    description: "Rate Response B on a 1-10 scale for overall quality."
    annotation_type: likert
    size: 10
    min_label: "1 - Terrible"
    max_label: "10 - Perfect"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 80
annotation_per_instance: 3

Sample Datasample-data.json

[
  {
    "id": "mt_001",
    "text": "Compose a short poem about the beauty of mathematics.",
    "response_a": "In numbers' dance, a truth unfolds,\nWhere logic weaves its threads of gold.\nFrom Euler's grace to pi's embrace,\nInfinity finds its resting place.\n\nEquations bloom like garden flowers,\nRevealing nature's hidden powers.\nIn every proof, a story told,\nOf beauty that will never grow old.",
    "response_b": "Math is beautiful because it has patterns.\nNumbers are interesting.\nGeometry has nice shapes.\nCalculus helps us understand change.\nMath is used in many fields.\nIt is important for science.",
    "category": "Writing",
    "turn_number": "1"
  },
  {
    "id": "mt_002",
    "text": "Explain the concept of opportunity cost in economics and give a real-world example.",
    "response_a": "Opportunity cost is the value of the next best alternative you give up when making a choice. It's a fundamental concept in economics that helps us understand the true cost of decisions.\n\nReal-world example: Say you have $50,000 saved and you're deciding between investing it in stocks or using it as a down payment on a rental property. If you choose stocks, the opportunity cost is the rental income and property appreciation you would have gained. If you choose the property, the opportunity cost is the potential stock market returns.\n\nThis concept reminds us that every choice has a hidden cost - what we give up by not choosing the alternative.",
    "response_b": "Opportunity cost refers to the potential benefit an individual, investor, or business misses out on when choosing one alternative over another. Because every resource (time, money, etc.) can be put to alternative uses, every action, choice, or decision has an associated opportunity cost.\n\nExample: Imagine you have a free Saturday. You could either work overtime for $200 or attend a friend's concert. If you go to the concert, your opportunity cost is the $200 you could have earned. Conversely, if you work, your opportunity cost is the enjoyment and social connection from the concert.\n\nImportantly, opportunity cost isn't always monetary - it includes intangible benefits like happiness, relationships, and personal growth.",
    "category": "Reasoning",
    "turn_number": "1"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/mtbench-llm-evaluation
potato start config.yaml

Details

Annotation Types

pairwiselikert

Domain

NLPLLM Evaluation

Use Cases

LLM BenchmarkingModel ComparisonQuality Assessment

Tags

llm-evaluationmt-benchpairwisechatbot-arenaresponse-qualitymulti-turn

Found an issue or want to improve this design?

Open an Issue