Skip to content
Showcase/MVBench Video Understanding
intermediatevideo

MVBench Video Understanding

Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., arXiv 2023). Tests temporal perception, action recognition, and state change detection in videos.

Frame 847 / 3200Running01:12 - 01:28Segments:WalkRunStandActionWalkRunStandWalkSceneOutdoorIndoorDrag to create and label temporal segments

Configuration Fileconfig.yaml

# MVBench Video Understanding
# Based on Li et al., arXiv 2023
# Paper: https://arxiv.org/abs/2311.17005
# Dataset: https://github.com/OpenGVLab/Ask-Anything
#
# Comprehensive video understanding benchmark testing temporal perception,
# action recognition, state change detection, and other dynamic video
# understanding capabilities. Annotators watch a video, answer a
# multiple-choice question, provide reasoning, and annotate relevant
# video segments.
#
# Task Types:
# - Action Sequence: Order of events in the video
# - Action Prediction: What happens next
# - Action Antonym: Identify opposite actions
# - State Change: Track object state changes
# - Object Existence: Whether objects appear/disappear
# - Counting: Count objects or events
# - Scene Transition: Identify scene changes
#
# Annotation Guidelines:
# 1. Watch the entire video before answering
# 2. Read the question and all four options carefully
# 3. Select the correct answer
# 4. Provide reasoning for your answer
# 5. Mark relevant video segments on the timeline

annotation_task_name: "MVBench Video Understanding"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Select the answer
  - annotation_type: radio
    name: answer
    description: "Select the correct answer based on the video."
    labels:
      - "A"
      - "B"
      - "C"
      - "D"
    keyboard_shortcuts:
      "A": "1"
      "B": "2"
      "C": "3"
      "D": "4"
    tooltips:
      "A": "Select option A"
      "B": "Select option B"
      "C": "Select option C"
      "D": "Select option D"

  # Step 2: Reasoning
  - annotation_type: text
    name: reasoning
    description: "Explain your reasoning. Reference specific moments or events in the video."
    textarea: true
    required: false
    placeholder: "Why did you choose this answer? Reference specific video moments..."

  # Step 3: Video segment annotation
  - annotation_type: video_annotation
    name: relevant_segments
    description: "Mark the video segments most relevant to answering the question."
    mode: segment
    labels:
      - name: "Action"
        color: "#4CAF50"
        key_value: "a"
      - name: "State Change"
        color: "#FF9800"
        key_value: "s"
      - name: "Temporal Event"
        color: "#2196F3"
        key_value: "t"

annotation_instructions: |
  You will answer questions about video content from the MVBench benchmark.

  For each item:
  1. Watch the entire video at least once before answering.
  2. Read the question and all four options (A-D) carefully.
  3. Select the single correct answer.
  4. Explain your reasoning, referencing specific video moments.
  5. On the video timeline, mark segments relevant to the question.

  Segment Labels:
  - Action: A specific action or activity being performed
  - State Change: An object or scene changing state (e.g., door opening, light turning on)
  - Temporal Event: A time-specific event relevant to the question

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px; margin-bottom: 16px;">
      <strong>Task Type:</strong> {{task_type}}
    </div>
    <div style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
      <video controls width="100%" style="max-height: 480px; border-radius: 4px;">
        <source src="{{video_url}}" type="video/mp4">
        Your browser does not support the video tag.
      </video>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Question:</strong>
      <p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">A:</strong> {{option_a}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">B:</strong> {{option_b}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">C:</strong> {{option_c}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">D:</strong> {{option_d}}
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "mvb_001",
    "text": "What action does the person perform first in the video?",
    "video_url": "https://example.com/mvbench/video_001.mp4",
    "option_a": "Picks up a cup from the table",
    "option_b": "Opens the refrigerator door",
    "option_c": "Turns on the faucet",
    "option_d": "Sits down on a chair",
    "task_type": "Action Sequence"
  },
  {
    "id": "mvb_002",
    "text": "What will the dog most likely do next based on its behavior in the video?",
    "video_url": "https://example.com/mvbench/video_002.mp4",
    "option_a": "Chase the ball",
    "option_b": "Lie down and sleep",
    "option_c": "Bark at the stranger",
    "option_d": "Jump into the water",
    "task_type": "Action Prediction"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/video-qa/mvbench-video-understanding
potato start config.yaml

Details

Annotation Types

radiotextvideo_annotation

Domain

Video UnderstandingMultimodal

Use Cases

Video QATemporal ReasoningVLM Benchmarking

Tags

mvbenchvideo-qatemporal-reasoningvideo-understandingarxiv2023

Found an issue or want to improve this design?

Open an Issue