MMMU: Massive Multi-discipline Multimodal Understanding
Multi-discipline multimodal QA requiring college-level understanding. Annotators answer multiple-choice questions that require interpreting images (charts, diagrams, photos) along with text across 30 subjects spanning STEM, humanities, social sciences, and more.
配置文件config.yaml
# MMMU: Massive Multi-discipline Multimodal Understanding
# Based on Yue et al., CVPR 2024
# Paper: https://openaccess.thecvf.com/content/CVPR2024/papers/Yue_MMMU_CVPR_2024_paper.pdf
#
# Multi-discipline multimodal QA benchmark requiring college-level subject
# knowledge and reasoning over images (charts, diagrams, photos) and text
# across 30 subjects including STEM, humanities, and social sciences.
#
# Annotation Guidelines:
# 1. Read the question carefully and examine the associated image
# 2. Consider the subject area and apply relevant domain knowledge
# 3. Select the best answer choice from the provided options
# 4. Use the image content to inform your answer — many questions
# cannot be answered from text alone
annotation_task_name: "MMMU: Multimodal Understanding"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
# Step 1: Select the correct answer
- annotation_type: radio
name: answer
description: "Select the best answer to the question based on the image and text."
labels:
- "A"
- "B"
- "C"
- "D"
keyboard_shortcuts:
"A": "a"
"B": "b"
"C": "c"
"D": "d"
tooltips:
"A": "Select option A"
"B": "Select option B"
"C": "Select option C"
"D": "Select option D"
# Step 2: Rate difficulty
- annotation_type: radio
name: difficulty
description: "How difficult was this question?"
labels:
- "Easy"
- "Medium"
- "Hard"
tooltips:
"Easy": "Answer is straightforward with basic subject knowledge"
"Medium": "Requires moderate reasoning or domain expertise"
"Hard": "Requires deep expertise or multi-step reasoning"
# Step 3: Confidence
- annotation_type: radio
name: confidence
description: "How confident are you in your answer?"
labels:
- "Very confident"
- "Somewhat confident"
- "Not confident"
html_layout: |
<div style="margin-bottom: 15px; padding: 10px; background: #f0f4f8; border-radius: 6px;">
<strong>Subject:</strong> {{subject}} — <strong>Subfield:</strong> {{subfield}}
</div>
<div style="text-align: center; margin-bottom: 15px;">
<img src="{{image_url}}" style="max-width: 100%; max-height: 500px; border: 1px solid #ddd; border-radius: 4px;" />
</div>
<div style="font-size: 16px; line-height: 1.6; margin-bottom: 15px;">
<strong>Question:</strong> {{text}}
</div>
<div style="padding: 10px; background: #fafafa; border-radius: 6px; line-height: 1.8;">
<strong>Options:</strong><br/>
{{options}}
</div>
allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false
示例数据sample-data.json
[
{
"id": "mmmu_001",
"text": "A patient presents with the ECG tracing shown in the image. Which of the following is the most likely diagnosis?",
"image_url": "https://example.com/mmmu/ecg_tracing_001.png",
"options": "A) Atrial fibrillation\nB) Ventricular tachycardia\nC) Second-degree AV block (Mobitz Type I)\nD) Normal sinus rhythm",
"subject": "Clinical Medicine",
"subfield": "Cardiology"
},
{
"id": "mmmu_002",
"text": "Based on the circuit diagram shown, what is the total resistance between terminals A and B?",
"image_url": "https://example.com/mmmu/circuit_diagram_002.png",
"options": "A) 10 ohms\nB) 15 ohms\nC) 20 ohms\nD) 25 ohms",
"subject": "Electrical Engineering",
"subfield": "Circuit Analysis"
}
]
// ... and 8 more items获取此设计
Clone or download from the repository
快速开始:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/multimodal/mmmu-multimodal-understanding potato start config.yaml
详情
标注类型
领域
应用场景
标签
发现问题或想改进此设计?
提交 Issue相关设计
ScienceQA Multimodal Reasoning
Multimodal science question answering with chain-of-thought reasoning, based on ScienceQA (Lu et al., NeurIPS 2022). Annotators answer multiple-choice science questions that may include images, provide chain-of-thought explanations, and categorize the science domain.
ADMIRE - Multimodal Idiomaticity Recognition
Multimodal idiomaticity detection task requiring annotators to identify whether expressions are used idiomatically or literally, with supporting cue analysis. Based on SemEval-2025 Task 1 (ADMIRE).
CHART-Infographics: Chart and Infographic Analysis
Chart and infographic analysis with structured extraction. Annotators identify chart elements (axes, legends, data points, titles) with bounding boxes, classify chart types, and extract data values. Supports structured understanding of visual data representations.