NExT-QA - Temporal and Causal Video Question Answering
Temporal and causal video question answering requiring reasoning about actions, events, and their relationships over time. Based on the NExT-QA dataset (Xiao et al., CVPR 2021), annotators answer multiple-choice questions about video content with an emphasis on temporal and causal understanding.
Configuration Fileconfig.yaml
# NExT-QA - Temporal and Causal Video Question Answering
# Based on Xiao et al., CVPR 2021
# Paper: https://arxiv.org/abs/2105.08276
# Dataset: https://github.com/doc-doc/NExT-QA
#
# Video QA task focusing on temporal and causal reasoning about events in videos.
# Annotators watch a video, answer a multiple-choice question (5 options),
# provide reasoning for their answer, and mark relevant video segments.
#
# Question Types:
# - Temporal: "What happened before/after X?" or "When did X happen?"
# - Causal: "Why did X happen?" or "What caused X?"
# - Descriptive: "What is X doing?" or "How many X are there?"
#
# Annotation Guidelines:
# 1. Watch the entire video before answering
# 2. Read all five answer options carefully
# 3. Select the best answer (A through E)
# 4. Write a brief explanation of your reasoning
# 5. Mark the video segment(s) most relevant to answering the question
annotation_task_name: "NExT-QA - Temporal Video QA"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Select the answer
- annotation_type: radio
name: answer_choice
description: "Select the best answer to the question based on the video."
labels:
- "A"
- "B"
- "C"
- "D"
- "E"
keyboard_shortcuts:
"A": "1"
"B": "2"
"C": "3"
"D": "4"
"E": "5"
tooltips:
"A": "Select option A"
"B": "Select option B"
"C": "Select option C"
"D": "Select option D"
"E": "Select option E"
# Step 2: Provide reasoning
- annotation_type: text
name: reasoning
description: "Briefly explain why you chose this answer. Reference specific events or moments in the video."
textarea: true
required: false
placeholder: "Explain your reasoning..."
# Step 3: Mark relevant video segments
- annotation_type: video_annotation
name: relevant_segments
description: "Mark the video segments most relevant to answering the question."
mode: segment
labels:
- name: "Relevant Segment"
color: "#4CAF50"
key_value: "r"
- name: "Key Action"
color: "#FF9800"
key_value: "k"
- name: "Background"
color: "#9E9E9E"
key_value: "b"
annotation_instructions: |
You will watch a video and answer a multiple-choice question about it.
For each item:
1. Watch the entire video at least once before answering.
2. Read the question and all five answer options (A-E).
3. Select the single best answer.
4. Write a brief explanation referencing specific video events.
5. On the video timeline, mark segments relevant to the question.
Question types you may encounter:
- Temporal: about the order or timing of events
- Causal: about why something happened or what caused it
- Descriptive: about what is happening in the video
html_layout: |
<div style="padding: 15px; max-width: 900px; margin: auto;">
<div style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
<video controls width="100%" style="max-height: 480px; border-radius: 4px;">
<source src="{{video_url}}" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Question:</strong>
<p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-bottom: 16px;">
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">A:</strong> {{option_a}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">B:</strong> {{option_b}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">C:</strong> {{option_c}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
<strong style="color: #475569;">D:</strong> {{option_d}}
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px; grid-column: 1 / -1;">
<strong style="color: #475569;">E:</strong> {{option_e}}
</div>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "nextqa_001",
"text": "Why did the woman pick up the toy from the floor?",
"video_url": "https://example.com/nextqa/video_001.mp4",
"option_a": "The baby threw it off the table",
"option_b": "She wanted to clean the room",
"option_c": "The dog brought it inside",
"option_d": "She was going to give it to the child",
"option_e": "It fell from the shelf by itself"
},
{
"id": "nextqa_002",
"text": "What happened right after the man finished stirring the pot?",
"video_url": "https://example.com/nextqa/video_002.mp4",
"option_a": "He turned off the stove",
"option_b": "He added more seasoning",
"option_c": "He tasted the food with a spoon",
"option_d": "He served the food onto plates",
"option_e": "He left the kitchen"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/video-qa/nextqa-temporal potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
Ego4D: Egocentric Video Episodic Memory Annotation
Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.
MVBench Video Understanding
Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., arXiv 2023). Tests temporal perception, action recognition, and state change detection in videos.
VSTAR Video-grounded Dialogue
Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.