MVBench Video Understanding

Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., CVPR 2024). Tests temporal perception, action recognition, and state change detection in videos.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# MVBench Video Understanding
# Based on Li et al., arXiv 2023
# Paper: https://arxiv.org/abs/2311.17005
# Dataset: https://github.com/OpenGVLab/Ask-Anything
#
# Comprehensive video understanding benchmark testing temporal perception,
# action recognition, state change detection, and other dynamic video
# understanding capabilities. Annotators watch a video, answer a
# multiple-choice question, provide reasoning, and annotate relevant
# video segments.
#
# Task Types:
# - Action Sequence: Order of events in the video
# - Action Prediction: What happens next
# - Action Antonym: Identify opposite actions
# - State Change: Track object state changes
# - Object Existence: Whether objects appear/disappear
# - Counting: Count objects or events
# - Scene Transition: Identify scene changes
#
# Annotation Guidelines:
# 1. Watch the entire video before answering
# 2. Read the question and all four options carefully
# 3. Select the correct answer
# 4. Provide reasoning for your answer
# 5. Mark relevant video segments on the timeline

annotation_task_name: "MVBench Video Understanding"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Select the answer
  - annotation_type: radio
    name: answer
    description: "Select the correct answer based on the video."
    labels:
      - "A"
      - "B"
      - "C"
      - "D"
    keyboard_shortcuts:
      "A": "1"
      "B": "2"
      "C": "3"
      "D": "4"
    tooltips:
      "A": "Select option A"
      "B": "Select option B"
      "C": "Select option C"
      "D": "Select option D"

  # Step 2: Reasoning
  - annotation_type: text
    name: reasoning
    description: "Explain your reasoning. Reference specific moments or events in the video."
    textarea: true
    required: false
    placeholder: "Why did you choose this answer? Reference specific video moments..."

  # Step 3: Video segment annotation
  - annotation_type: video_annotation
    name: relevant_segments
    description: "Mark the video segments most relevant to answering the question."
    mode: segment
    labels:
      - name: "Action"
        color: "#4CAF50"
        key_value: "a"
      - name: "State Change"
        color: "#FF9800"
        key_value: "s"
      - name: "Temporal Event"
        color: "#2196F3"
        key_value: "t"

annotation_instructions: |
  You will answer questions about video content from the MVBench benchmark.

  For each item:
  1. Watch the entire video at least once before answering.
  2. Read the question and all four options (A-D) carefully.
  3. Select the single correct answer.
  4. Explain your reasoning, referencing specific video moments.
  5. On the video timeline, mark segments relevant to the question.

  Segment Labels:
  - Action: A specific action or activity being performed
  - State Change: An object or scene changing state (e.g., door opening, light turning on)
  - Temporal Event: A time-specific event relevant to the question

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px; margin-bottom: 16px;">
      <strong>Task Type:</strong> {{task_type}}
    </div>
    <div style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
      <video controls width="100%" style="max-height: 480px; border-radius: 4px;">
        <source src="{{video_url}}" type="video/mp4">
        Your browser does not support the video tag.
      </video>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Question:</strong>
      <p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">A:</strong> {{option_a}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">B:</strong> {{option_b}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">C:</strong> {{option_c}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">D:</strong> {{option_d}}
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "mvb_001",
    "text": "What action does the person perform first in the video?",
    "video_url": "https://example.com/mvbench/video_001.mp4",
    "option_a": "Picks up a cup from the table",
    "option_b": "Opens the refrigerator door",
    "option_c": "Turns on the faucet",
    "option_d": "Sits down on a chair",
    "task_type": "Action Sequence"
  },
  {
    "id": "mvb_002",
    "text": "What will the dog most likely do next based on its behavior in the video?",
    "video_url": "https://example.com/mvbench/video_002.mp4",
    "option_a": "Chase the ball",
    "option_b": "Lie down and sleep",
    "option_c": "Bark at the stranger",
    "option_d": "Jump into the water",
    "task_type": "Action Prediction"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/video-qa/mvbench-video-understanding
potato start config.yaml

Dataset & paper

Li et al., CVPR 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{li2024mvbench,
    title={MVBench: A Comprehensive Multi-modal Video Understanding Benchmark},
    author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and Wang, Limin and Qiao, Yu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024},
    url={https://arxiv.org/abs/2311.17005}
}

Details

Annotation Types

radiotextvideo_annotation

Domain

Video UnderstandingMultimodal

Use Cases

Video QATemporal ReasoningVLM Benchmarking

Related Designs

Ego4D: Egocentric Video Episodic Memory Annotation

Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.

video_annotationtext

NExT-QA: Causal and Temporal Video QA Benchmark

NExT-QA (Xiao et al., CVPR 2021) is a video QA benchmark for causal and temporal action reasoning over 5,440 videos and ~52K questions. This Potato config reproduces its multiple-choice answer task.

radiotext

VSTAR Video-grounded Dialogue

Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.