MT-Bench: LLM Response Quality Evaluation
LLM response quality evaluation using pairwise comparison and absolute scoring. Annotators compare pairs of LLM responses and rate individual responses on a 1-10 scale across multiple turns.
Configuration Fileconfig.yaml
# MT-Bench: LLM Response Quality Evaluation
# Based on "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., NeurIPS 2023)
# Task: Pairwise comparison and absolute scoring of LLM responses
annotation_task_name: "MT-Bench LLM Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing question and both responses side-by-side
html_layout: |
<div class="mtbench-container">
<div class="metadata-bar" style="display: flex; gap: 15px; margin-bottom: 15px;">
<div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px;">
<strong>Category:</strong> {{category}}
</div>
<div style="background: #f3e5f5; padding: 8px 15px; border-radius: 8px;">
<strong>Turn:</strong> {{turn_number}}
</div>
</div>
<div class="question-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 20px;">
<h3 style="margin-top: 0;">Question:</h3>
<div class="question-text" style="font-size: 16px;">{{text}}</div>
</div>
<div class="responses-section" style="display: flex; gap: 20px;">
<div class="response-a" style="flex: 1; background: #e3f2fd; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
<h4 style="margin-top: 0; color: #1976d2;">Response A:</h4>
<div class="response-text" style="white-space: pre-wrap; font-size: 14px;">{{response_a}}</div>
</div>
<div class="response-b" style="flex: 1; background: #fce4ec; padding: 15px; border-radius: 8px; border: 2px solid #c62828;">
<h4 style="margin-top: 0; color: #c62828;">Response B:</h4>
<div class="response-text" style="white-space: pre-wrap; font-size: 14px;">{{response_b}}</div>
</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Pairwise preference
- name: "preference"
description: "Which response is better overall? Consider helpfulness, accuracy, depth, and clarity."
annotation_type: pairwise
labels:
- "Response A is much better"
- "Response A is slightly better"
- "Tie"
- "Response B is slightly better"
- "Response B is much better"
keyboard_shortcuts:
"Response A is much better": "1"
"Response A is slightly better": "2"
"Tie": "3"
"Response B is slightly better": "4"
"Response B is much better": "5"
# Absolute score for Response A
- name: "score_response_a"
description: "Rate Response A on a 1-10 scale for overall quality."
annotation_type: likert
size: 10
min_label: "1 - Terrible"
max_label: "10 - Perfect"
# Absolute score for Response B
- name: "score_response_b"
description: "Rate Response B on a 1-10 scale for overall quality."
annotation_type: likert
size: 10
min_label: "1 - Terrible"
max_label: "10 - Perfect"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 80
annotation_per_instance: 3
Sample Datasample-data.json
[
{
"id": "mt_001",
"text": "Compose a short poem about the beauty of mathematics.",
"response_a": "In numbers' dance, a truth unfolds,\nWhere logic weaves its threads of gold.\nFrom Euler's grace to pi's embrace,\nInfinity finds its resting place.\n\nEquations bloom like garden flowers,\nRevealing nature's hidden powers.\nIn every proof, a story told,\nOf beauty that will never grow old.",
"response_b": "Math is beautiful because it has patterns.\nNumbers are interesting.\nGeometry has nice shapes.\nCalculus helps us understand change.\nMath is used in many fields.\nIt is important for science.",
"category": "Writing",
"turn_number": "1"
},
{
"id": "mt_002",
"text": "Explain the concept of opportunity cost in economics and give a real-world example.",
"response_a": "Opportunity cost is the value of the next best alternative you give up when making a choice. It's a fundamental concept in economics that helps us understand the true cost of decisions.\n\nReal-world example: Say you have $50,000 saved and you're deciding between investing it in stocks or using it as a down payment on a rental property. If you choose stocks, the opportunity cost is the rental income and property appreciation you would have gained. If you choose the property, the opportunity cost is the potential stock market returns.\n\nThis concept reminds us that every choice has a hidden cost - what we give up by not choosing the alternative.",
"response_b": "Opportunity cost refers to the potential benefit an individual, investor, or business misses out on when choosing one alternative over another. Because every resource (time, money, etc.) can be put to alternative uses, every action, choice, or decision has an associated opportunity cost.\n\nExample: Imagine you have a free Saturday. You could either work overtime for $200 or attend a friend's concert. If you go to the concert, your opportunity cost is the $200 you could have earned. Conversely, if you work, your opportunity cost is the enjoyment and social connection from the concert.\n\nImportantly, opportunity cost isn't always monetary - it includes intangible benefits like happiness, relationships, and personal growth.",
"category": "Reasoning",
"turn_number": "1"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/mtbench-llm-evaluation potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
WildBench - LLM Evaluation on Real-World Tasks
Evaluation of LLM outputs on challenging real-world user queries from WildBench. Annotators compare two model responses via pairwise preference, rate overall quality on a Likert scale, and provide reasoning for their judgments.
MT-Bench Judge Consistency Evaluation
Multi-turn conversation evaluation for LLM judge consistency, based on MT-Bench (Zheng et al., NeurIPS 2023). Annotators compare two assistant responses in a pairwise setting, rate overall quality on a 1-10 Likert scale, and classify the conversation category.
Arena Hard Auto - LLM Pairwise Evaluation
Pairwise evaluation of LLM responses on challenging prompts from the Arena Hard benchmark (Li et al., arXiv 2024). Annotators compare two responses on a continuous scale and rate question difficulty.