Showcase/NExT-QA: Causal and Temporal Video QA Benchmark

intermediatevideo

NExT-QA: Causal and Temporal Video QA Benchmark

NExT-QA (Xiao et al., CVPR 2021) is a video QA benchmark for causal and temporal action reasoning over 5,440 videos and ~52K questions. This Potato config reproduces its multiple-choice answer task.

About this dataset

NExT-QA is a video question answering benchmark built by Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua at the National University of Singapore, published at CVPR 2021. It moves video QA beyond recognizing what is happening toward explaining why and how events unfold.

The dataset covers 5,440 videos with an average length of 44 seconds, drawn from daily activities rich in multi-object interactions. It pairs them with about 52,000 manually annotated question-answer pairs, with roughly 10 questions per video.

Questions split into three types: causal (48%), such as why a toddler is crying; temporal (29%), such as how a person reacted after an event; and descriptive (23%), covering scene and object facts. NExT-QA supports both a multiple-choice setting with five candidate answers and an open-ended setting (NExT-OE).

The Potato config below reproduces the multiple-choice setting: it shows a video, a question, and five candidate answers as radio options, with an optional free-text field for the open-ended variant.

Videos: 5,440
Avg video length: 44 seconds
QA pairs: ~52,000
Question types: Causal 48%, Temporal 29%, Descriptive 23%
Settings: Multiple-choice (5 options) and open-ended
Video splits: Train 3,870 / Val 570 / Test 1,000

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# NExT-QA - Temporal and Causal Video Question Answering
# Based on Xiao et al., CVPR 2021
# Paper: https://arxiv.org/abs/2105.08276
# Dataset: https://github.com/doc-doc/NExT-QA
#
# Video QA task focusing on temporal and causal reasoning about events in videos.
# Annotators watch a video, answer a multiple-choice question (5 options),
# provide reasoning for their answer, and mark relevant video segments.
#
# Question Types:
# - Temporal: "What happened before/after X?" or "When did X happen?"
# - Causal: "Why did X happen?" or "What caused X?"
# - Descriptive: "What is X doing?" or "How many X are there?"
#
# Annotation Guidelines:
# 1. Watch the entire video before answering
# 2. Read all five answer options carefully
# 3. Select the best answer (A through E)
# 4. Write a brief explanation of your reasoning
# 5. Mark the video segment(s) most relevant to answering the question

annotation_task_name: "NExT-QA - Temporal Video QA"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Select the answer
  - annotation_type: radio
    name: answer_choice
    description: "Select the best answer to the question based on the video."
    labels:
      - "A"
      - "B"
      - "C"
      - "D"
      - "E"
    keyboard_shortcuts:
      "A": "1"
      "B": "2"
      "C": "3"
      "D": "4"
      "E": "5"
    tooltips:
      "A": "Select option A"
      "B": "Select option B"
      "C": "Select option C"
      "D": "Select option D"
      "E": "Select option E"

  # Step 2: Provide reasoning
  - annotation_type: text
    name: reasoning
    description: "Briefly explain why you chose this answer. Reference specific events or moments in the video."
    textarea: true
    required: false
    placeholder: "Explain your reasoning..."

  # Step 3: Mark relevant video segments
  - annotation_type: video_annotation
    name: relevant_segments
    description: "Mark the video segments most relevant to answering the question."
    mode: segment
    labels:
      - name: "Relevant Segment"
        color: "#4CAF50"
        key_value: "r"
      - name: "Key Action"
        color: "#FF9800"
        key_value: "k"
      - name: "Background"
        color: "#9E9E9E"
        key_value: "b"

annotation_instructions: |
  You will watch a video and answer a multiple-choice question about it.

  For each item:
  1. Watch the entire video at least once before answering.
  2. Read the question and all five answer options (A-E).
  3. Select the single best answer.
  4. Write a brief explanation referencing specific video events.
  5. On the video timeline, mark segments relevant to the question.

  Question types you may encounter:
  - Temporal: about the order or timing of events
  - Causal: about why something happened or what caused it
  - Descriptive: about what is happening in the video

html_layout: |
  <div style="padding: 15px; max-width: 900px; margin: auto;">
    <div style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
      <video controls width="100%" style="max-height: 480px; border-radius: 4px;">
        <source src="{{video_url}}" type="video/mp4">
        Your browser does not support the video tag.
      </video>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Question:</strong>
      <p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-bottom: 16px;">
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">A:</strong> {{option_a}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">B:</strong> {{option_b}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">C:</strong> {{option_c}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">D:</strong> {{option_d}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px; grid-column: 1 / -1;">
        <strong style="color: #475569;">E:</strong> {{option_e}}
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "nextqa_001",
    "text": "Why did the woman pick up the toy from the floor?",
    "video_url": "https://example.com/nextqa/video_001.mp4",
    "option_a": "The baby threw it off the table",
    "option_b": "She wanted to clean the room",
    "option_c": "The dog brought it inside",
    "option_d": "She was going to give it to the child",
    "option_e": "It fell from the shelf by itself"
  },
  {
    "id": "nextqa_002",
    "text": "What happened right after the man finished stirring the pot?",
    "video_url": "https://example.com/nextqa/video_002.mp4",
    "option_a": "He turned off the stove",
    "option_b": "He added more seasoning",
    "option_c": "He tasted the food with a spoon",
    "option_d": "He served the food onto plates",
    "option_e": "He left the kitchen"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/video-qa/nextqa-temporal
potato start config.yaml

Dataset & paper

Xiao et al., CVPR 2021

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{xiao2021nextqa,
    title = "{NExT-QA}: Next Phase of Question-Answering to Explaining Temporal Actions",
    author = "Xiao, Junbin and Shang, Xindi and Yao, Angela and Chua, Tat-Seng",
    booktitle = "Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)",
    year = "2021",
    url = "https://arxiv.org/abs/2105.08276"
}

Details

Annotation Types

radiotextvideo_annotation

Domain

Video UnderstandingNLP

Use Cases

Video Question AnsweringTemporal ReasoningCausal Reasoning

Related Designs

Ego4D: Egocentric Video Episodic Memory Annotation

Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.

video_annotationtext

MVBench Video Understanding

Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., arXiv 2023). Tests temporal perception, action recognition, and state change detection in videos.

radiotext

VSTAR Video-grounded Dialogue

Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.