Skip to content
Showcase/NExT-QA - Temporal and Causal Video Question Answering
intermediatevideo

NExT-QA - Temporal and Causal Video Question Answering

Temporal and causal video question answering requiring reasoning about actions, events, and their relationships over time. Based on the NExT-QA dataset (Xiao et al., CVPR 2021), annotators answer multiple-choice questions about video content with an emphasis on temporal and causal understanding.

Frame 847 / 3200Running01:12 - 01:28Segments:WalkRunStandActionWalkRunStandWalkSceneOutdoorIndoorDrag to create and label temporal segments

Configuration Fileconfig.yaml

# NExT-QA - Temporal and Causal Video Question Answering
# Based on Xiao et al., CVPR 2021
# Paper: https://arxiv.org/abs/2105.08276
# Dataset: https://github.com/doc-doc/NExT-QA
#
# Video QA task focusing on temporal and causal reasoning about events in videos.
# Annotators watch a video, answer a multiple-choice question (5 options),
# provide reasoning for their answer, and mark relevant video segments.
#
# Question Types:
# - Temporal: "What happened before/after X?" or "When did X happen?"
# - Causal: "Why did X happen?" or "What caused X?"
# - Descriptive: "What is X doing?" or "How many X are there?"
#
# Annotation Guidelines:
# 1. Watch the entire video before answering
# 2. Read all five answer options carefully
# 3. Select the best answer (A through E)
# 4. Write a brief explanation of your reasoning
# 5. Mark the video segment(s) most relevant to answering the question

annotation_task_name: "NExT-QA - Temporal Video QA"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Select the answer
  - annotation_type: radio
    name: answer_choice
    description: "Select the best answer to the question based on the video."
    labels:
      - "A"
      - "B"
      - "C"
      - "D"
      - "E"
    keyboard_shortcuts:
      "A": "1"
      "B": "2"
      "C": "3"
      "D": "4"
      "E": "5"
    tooltips:
      "A": "Select option A"
      "B": "Select option B"
      "C": "Select option C"
      "D": "Select option D"
      "E": "Select option E"

  # Step 2: Provide reasoning
  - annotation_type: text
    name: reasoning
    description: "Briefly explain why you chose this answer. Reference specific events or moments in the video."
    textarea: true
    required: false
    placeholder: "Explain your reasoning..."

  # Step 3: Mark relevant video segments
  - annotation_type: video_annotation
    name: relevant_segments
    description: "Mark the video segments most relevant to answering the question."
    mode: segment
    labels:
      - name: "Relevant Segment"
        color: "#4CAF50"
        key_value: "r"
      - name: "Key Action"
        color: "#FF9800"
        key_value: "k"
      - name: "Background"
        color: "#9E9E9E"
        key_value: "b"

annotation_instructions: |
  You will watch a video and answer a multiple-choice question about it.

  For each item:
  1. Watch the entire video at least once before answering.
  2. Read the question and all five answer options (A-E).
  3. Select the single best answer.
  4. Write a brief explanation referencing specific video events.
  5. On the video timeline, mark segments relevant to the question.

  Question types you may encounter:
  - Temporal: about the order or timing of events
  - Causal: about why something happened or what caused it
  - Descriptive: about what is happening in the video

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
      <video controls width="100%" style="max-height: 480px; border-radius: 4px;">
        <source src="{{video_url}}" type="video/mp4">
        Your browser does not support the video tag.
      </video>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Question:</strong>
      <p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-bottom: 16px;">
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">A:</strong> {{option_a}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">B:</strong> {{option_b}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">C:</strong> {{option_c}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">D:</strong> {{option_d}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px; grid-column: 1 / -1;">
        <strong style="color: #475569;">E:</strong> {{option_e}}
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "nextqa_001",
    "text": "Why did the woman pick up the toy from the floor?",
    "video_url": "https://example.com/nextqa/video_001.mp4",
    "option_a": "The baby threw it off the table",
    "option_b": "She wanted to clean the room",
    "option_c": "The dog brought it inside",
    "option_d": "She was going to give it to the child",
    "option_e": "It fell from the shelf by itself"
  },
  {
    "id": "nextqa_002",
    "text": "What happened right after the man finished stirring the pot?",
    "video_url": "https://example.com/nextqa/video_002.mp4",
    "option_a": "He turned off the stove",
    "option_b": "He added more seasoning",
    "option_c": "He tasted the food with a spoon",
    "option_d": "He served the food onto plates",
    "option_e": "He left the kitchen"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/video-qa/nextqa-temporal
potato start config.yaml

Details

Annotation Types

radiotextvideo_annotation

Domain

Video UnderstandingNLP

Use Cases

Video Question AnsweringTemporal ReasoningCausal Reasoning

Tags

nextqavideo-qatemporalcausalreasoningcvpr2021multiple-choice

Found an issue or want to improve this design?

Open an Issue