MMLU Knowledge Evaluation
Multiple-choice knowledge evaluation across diverse academic subjects, based on the Massive Multitask Language Understanding benchmark (Hendrycks et al., ICLR 2021). Annotators select the correct answer from four options and provide an explanation.
Configuration Fileconfig.yaml
# MMLU Knowledge Evaluation
# Based on Hendrycks et al., ICLR 2021
# Paper: https://arxiv.org/abs/2009.03300
# Dataset: https://huggingface.co/datasets/cais/mmlu
#
# Multiple-choice knowledge evaluation based on the Massive Multitask
# Language Understanding benchmark. Each question covers one of 57
# academic subjects spanning STEM, humanities, social sciences, and more.
# Annotators select the correct answer and provide an explanation.
#
# Answer Options:
# - A, B, C, D: Four possible answers; exactly one is correct
#
# Annotation Guidelines:
# 1. Read the question carefully
# 2. Consider all four answer options before selecting
# 3. Choose the single best answer
# 4. Provide a brief explanation of your reasoning
annotation_task_name: "MMLU Knowledge Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Select the correct answer
- annotation_type: radio
name: answer
description: "Select the correct answer to this question."
labels:
- "A"
- "B"
- "C"
- "D"
keyboard_shortcuts:
"A": "1"
"B": "2"
"C": "3"
"D": "4"
tooltips:
"A": "Select option A"
"B": "Select option B"
"C": "Select option C"
"D": "Select option D"
# Step 2: Explanation
- annotation_type: text
name: explanation
description: "Briefly explain why you chose this answer."
textarea: true
required: false
placeholder: "Explain your reasoning..."
annotation_instructions: |
You will answer multiple-choice knowledge questions from the MMLU benchmark.
For each item:
1. Read the question and note the subject area.
2. Read all four answer options (A, B, C, D) carefully.
3. Select the single correct answer.
4. Optionally, provide a brief explanation of your reasoning.
Tips:
- Questions span many subjects; use your best knowledge.
- Eliminate clearly wrong options first to narrow your choice.
- If unsure, make your best guess rather than skipping.
html_layout: |
<div style="padding: 15px; max-width: 800px; margin: auto;">
<div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px; margin-bottom: 16px;">
<strong>Subject:</strong> {{subject}}
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Question:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">A:</strong> {{option_a}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">B:</strong> {{option_b}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">C:</strong> {{option_c}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">D:</strong> {{option_d}}
</div>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "mmlu_001",
"text": "What is the primary function of mitochondria in eukaryotic cells?",
"option_a": "Protein synthesis",
"option_b": "ATP production through cellular respiration",
"option_c": "DNA replication",
"option_d": "Lipid storage",
"subject": "Biology"
},
{
"id": "mmlu_002",
"text": "In economics, what does GDP stand for?",
"option_a": "General Domestic Product",
"option_b": "Gross Domestic Product",
"option_c": "Global Development Program",
"option_d": "Gross Development Percentage",
"subject": "Economics"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/mmlu-knowledge-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
BIG-Bench Task Evaluation
Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.
Bias Benchmark for QA (BBQ)
Annotate question-answering examples designed to probe social biases. Based on BBQ (Parrish et al., Findings of ACL 2022). Annotators select the correct answer given a context, assess the direction of bias in the question, categorize the type of bias, and explain their reasoning.
Code Generation Evaluation (HumanEval)
Evaluation of LLM-generated code based on the HumanEval benchmark. Annotators assess functional correctness, code quality, and efficiency of generated Python functions, and provide explanations of errors and improvement suggestions, supporting research in code generation and LLM evaluation.