MVBench Video Understanding
Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., arXiv 2023). Tests temporal perception, action recognition, and state change detection in videos.
Configuration Fileconfig.yaml
# MVBench Video Understanding
# Based on Li et al., arXiv 2023
# Paper: https://arxiv.org/abs/2311.17005
# Dataset: https://github.com/OpenGVLab/Ask-Anything
#
# Comprehensive video understanding benchmark testing temporal perception,
# action recognition, state change detection, and other dynamic video
# understanding capabilities. Annotators watch a video, answer a
# multiple-choice question, provide reasoning, and annotate relevant
# video segments.
#
# Task Types:
# - Action Sequence: Order of events in the video
# - Action Prediction: What happens next
# - Action Antonym: Identify opposite actions
# - State Change: Track object state changes
# - Object Existence: Whether objects appear/disappear
# - Counting: Count objects or events
# - Scene Transition: Identify scene changes
#
# Annotation Guidelines:
# 1. Watch the entire video before answering
# 2. Read the question and all four options carefully
# 3. Select the correct answer
# 4. Provide reasoning for your answer
# 5. Mark relevant video segments on the timeline
annotation_task_name: "MVBench Video Understanding"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Select the answer
- annotation_type: radio
name: answer
description: "Select the correct answer based on the video."
labels:
- "A"
- "B"
- "C"
- "D"
keyboard_shortcuts:
"A": "1"
"B": "2"
"C": "3"
"D": "4"
tooltips:
"A": "Select option A"
"B": "Select option B"
"C": "Select option C"
"D": "Select option D"
# Step 2: Reasoning
- annotation_type: text
name: reasoning
description: "Explain your reasoning. Reference specific moments or events in the video."
textarea: true
required: false
placeholder: "Why did you choose this answer? Reference specific video moments..."
# Step 3: Video segment annotation
- annotation_type: video_annotation
name: relevant_segments
description: "Mark the video segments most relevant to answering the question."
mode: segment
labels:
- name: "Action"
color: "#4CAF50"
key_value: "a"
- name: "State Change"
color: "#FF9800"
key_value: "s"
- name: "Temporal Event"
color: "#2196F3"
key_value: "t"
annotation_instructions: |
You will answer questions about video content from the MVBench benchmark.
For each item:
1. Watch the entire video at least once before answering.
2. Read the question and all four options (A-D) carefully.
3. Select the single correct answer.
4. Explain your reasoning, referencing specific video moments.
5. On the video timeline, mark segments relevant to the question.
Segment Labels:
- Action: A specific action or activity being performed
- State Change: An object or scene changing state (e.g., door opening, light turning on)
- Temporal Event: A time-specific event relevant to the question
html_layout: |
<div style="padding: 15px; max-width: 900px; margin: auto;">
<div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px; margin-bottom: 16px;">
<strong>Task Type:</strong> {{task_type}}
</div>
<div style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
<video controls width="100%" style="max-height: 480px; border-radius: 4px;">
<source src="{{video_url}}" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Question:</strong>
<p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">A:</strong> {{option_a}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">B:</strong> {{option_b}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">C:</strong> {{option_c}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">D:</strong> {{option_d}}
</div>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "mvb_001",
"text": "What action does the person perform first in the video?",
"video_url": "https://example.com/mvbench/video_001.mp4",
"option_a": "Picks up a cup from the table",
"option_b": "Opens the refrigerator door",
"option_c": "Turns on the faucet",
"option_d": "Sits down on a chair",
"task_type": "Action Sequence"
},
{
"id": "mvb_002",
"text": "What will the dog most likely do next based on its behavior in the video?",
"video_url": "https://example.com/mvbench/video_002.mp4",
"option_a": "Chase the ball",
"option_b": "Lie down and sleep",
"option_c": "Bark at the stranger",
"option_d": "Jump into the water",
"task_type": "Action Prediction"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/video-qa/mvbench-video-understanding potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
Ego4D: Egocentric Video Episodic Memory Annotation
Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.
NExT-QA - Temporal and Causal Video Question Answering
Temporal and causal video question answering requiring reasoning about actions, events, and their relationships over time. Based on the NExT-QA dataset (Xiao et al., CVPR 2021), annotators answer multiple-choice questions about video content with an emphasis on temporal and causal understanding.
VSTAR Video-grounded Dialogue
Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.