MMBench Multimodal Evaluation
Multimodal evaluation benchmark combining image understanding with multiple-choice questions, based on MMBench (Liu et al., ECCV 2024). Annotators answer image-based questions, provide explanations, and tag the required perception or reasoning skills.
配置文件config.yaml
# MMBench Multimodal Evaluation
# Based on Liu et al., ECCV 2024
# Paper: https://arxiv.org/abs/2307.06281
# Dataset: https://github.com/open-compass/MMBench
#
# Multimodal evaluation benchmark combining image understanding with
# multiple-choice questions. Tests a variety of visual perception and
# reasoning abilities. Annotators view an image, answer a multiple-choice
# question, explain their reasoning, and tag which skills are required.
#
# Answer Options:
# - A, B, C, D: Four possible answers; exactly one is correct
#
# Skill Tags (select all that apply):
# - Visual Perception: Identifying objects, colors, shapes
# - Spatial Reasoning: Understanding spatial relationships and layouts
# - OCR: Reading text in images
# - Object Recognition: Identifying specific objects or entities
# - Scene Understanding: Comprehending the overall scene or context
# - Knowledge: Requiring external knowledge beyond what's visible
#
# Annotation Guidelines:
# 1. Examine the image carefully
# 2. Read the question and all four options
# 3. Select the correct answer
# 4. Explain your reasoning
# 5. Tag which visual/reasoning skills are needed
annotation_task_name: "MMBench Multimodal Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Select the correct answer
- annotation_type: radio
name: answer
description: "Based on the image, select the correct answer."
labels:
- "A"
- "B"
- "C"
- "D"
keyboard_shortcuts:
"A": "1"
"B": "2"
"C": "3"
"D": "4"
tooltips:
"A": "Select option A"
"B": "Select option B"
"C": "Select option C"
"D": "Select option D"
# Step 2: Explanation
- annotation_type: text
name: explanation
description: "Briefly explain your reasoning for the selected answer."
textarea: true
required: false
placeholder: "Why did you choose this answer?"
# Step 3: Required skills
- annotation_type: multiselect
name: required_skills
description: "Which visual or reasoning skills are needed to answer this question? Select all that apply."
labels:
- "Visual Perception"
- "Spatial Reasoning"
- "OCR"
- "Object Recognition"
- "Scene Understanding"
- "Knowledge"
tooltips:
"Visual Perception": "Identifying basic visual attributes like colors, shapes, sizes"
"Spatial Reasoning": "Understanding spatial relationships, positions, and layouts"
"OCR": "Reading or recognizing text visible in the image"
"Object Recognition": "Identifying specific objects, animals, or entities"
"Scene Understanding": "Comprehending the overall scene, context, or activity"
"Knowledge": "Requiring external knowledge beyond what is visible in the image"
annotation_instructions: |
You will evaluate multimodal questions from the MMBench benchmark.
For each item:
1. Examine the image carefully before reading the question.
2. Read the question and all four answer options (A, B, C, D).
3. Select the single correct answer based on the image.
4. Briefly explain your reasoning.
5. Tag which skills are required to answer this question.
Tips:
- Pay close attention to details in the image.
- Some questions require reading text in the image (OCR).
- Some questions require world knowledge beyond what's visible.
html_layout: |
<div style="padding: 15px; max-width: 800px; margin: auto;">
<div style="text-align: center; margin-bottom: 16px;">
<img src="{{image_url}}" style="max-width: 100%; max-height: 500px; border: 1px solid #ddd; border-radius: 8px;" />
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Question:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">A:</strong> {{option_a}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">B:</strong> {{option_b}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">C:</strong> {{option_c}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">D:</strong> {{option_d}}
</div>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false
示例数据sample-data.json
[
{
"id": "mmb_001",
"text": "How many red apples are visible on the table?",
"image_url": "https://example.com/mmbench/image_001.jpg",
"option_a": "Two",
"option_b": "Three",
"option_c": "Four",
"option_d": "Five"
},
{
"id": "mmb_002",
"text": "What is the person in the image doing?",
"image_url": "https://example.com/mmbench/image_002.jpg",
"option_a": "Reading a book",
"option_b": "Cooking a meal",
"option_c": "Playing a guitar",
"option_d": "Writing on a whiteboard"
}
]
// ... and 8 more items获取此设计
Clone or download from the repository
快速开始:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/multimodal/mmbench-multimodal-eval potato start config.yaml
详情
标注类型
领域
应用场景
标签
发现问题或想改进此设计?
提交 Issue相关设计
CUB-200-2011 Fine-Grained Bird Classification
Fine-grained visual categorization of 200 bird species (Wah et al., 2011). Annotate bird images with species labels, part locations, and attribute annotations.
FLAIR: French Land Cover from Aerospace Imagery
Land use and land cover classification from high-resolution aerial imagery. Annotators classify the primary land use category of aerial image patches and identify any secondary land uses present. Based on the FLAIR dataset from the French National Institute of Geographic and Forest Information (IGN).
iWildCam Wildlife Detection & Classification
Camera trap image classification for wildlife monitoring (Beery et al., CVPR 2019). Classify wildlife species from camera trap images across diverse ecosystems worldwide.