DiDeMo Moment Retrieval
Localizing natural language descriptions to specific video moments. Given a text query, annotators identify the corresponding temporal segment in the video.
Configuration Fileconfig.yaml
# DiDeMo Moment Retrieval Configuration
# Based on Hendricks et al., ICCV 2017
# Task: Localize text descriptions to video moments
annotation_task_name: "DiDeMo Moment Retrieval"
task_dir: "."
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "video_url"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="moment-retrieval">
<div class="query-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">🔍 Find this moment:</h3>
<div class="query-text" style="font-size: 18px; font-weight: bold;">{{query}}</div>
</div>
</div>
annotation_schemes:
- name: "moment_segment"
description: |
Mark the temporal segment that corresponds to the text description.
Select the 5-second segment(s) that best match the query.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "matches_query"
color: "#22C55E"
key_value: "m"
zoom_enabled: true
playback_rate_control: true
frame_stepping: true
show_timecode: true
- name: "confidence"
description: "How confident are you that this is the correct moment?"
annotation_type: radio
labels:
- "Very confident - exact match"
- "Confident - good match"
- "Somewhat confident - partial match"
- "Not confident - best guess"
- name: "ambiguity"
description: "Is the query ambiguous or could match multiple moments?"
annotation_type: radio
labels:
- "Clear - only one possible match"
- "Slightly ambiguous - 1-2 alternatives"
- "Very ambiguous - multiple valid matches"
allow_all_users: true
instances_per_annotator: 60
annotation_per_instance: 3
annotation_instructions: |
## Video Moment Retrieval Task
Given a text description, find the matching moment in the video.
### How to Annotate:
1. Read the query carefully
2. Watch the video to understand the content
3. Mark the segment that BEST matches the description
4. Rate your confidence
### Guidelines:
- Videos are ~30 seconds, divided into 5-second segments
- Select the segment(s) that match the query
- If multiple segments match, select all of them
- If no segment matches well, select the closest one
### Example Queries:
- "The dog runs across the yard"
- "Someone opens a door"
- "A person laughs at something"
### Tips:
- Focus on the action/event described
- Consider synonyms (e.g., "runs" = "sprints")
- The query may not describe every detail visible
Sample Datasample-data.json
[
{
"id": "didemo_001",
"video_url": "https://example.com/videos/flickr_video_001.mp4",
"query": "A dog jumps into the water",
"duration_seconds": 30
},
{
"id": "didemo_002",
"video_url": "https://example.com/videos/flickr_video_002.mp4",
"query": "Someone picks up a child",
"duration_seconds": 30
}
]
// ... and 1 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/temporal-grounding/didemo-moment-retrieval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
NExT-QA - Temporal and Causal Video Question Answering
Temporal and causal video question answering requiring reasoning about actions, events, and their relationships over time. Based on the NExT-QA dataset (Xiao et al., CVPR 2021), annotators answer multiple-choice questions about video content with an emphasis on temporal and causal understanding.
Scene Boundary Detection
Identify scene boundaries in documentary and narrative videos. Annotators mark transitions between semantically coherent scenes based on visual, audio, and narrative cues.
VSTAR Video-grounded Dialogue
Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.