DiDeMo Moment Retrieval

Localizing natural language descriptions to specific video moments. Given a text query, annotators identify the corresponding temporal segment in the video.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# DiDeMo Moment Retrieval Configuration
# Based on Hendricks et al., ICCV 2017
# Task: Localize text descriptions to video moments

annotation_task_name: "DiDeMo Moment Retrieval"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="moment-retrieval">
    <div class="query-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">🔍 Find this moment:</h3>
      <div class="query-text" style="font-size: 18px; font-weight: bold;">{{query}}</div>
    </div>
  </div>

annotation_schemes:
  - name: "moment_segment"
    description: |
      Mark the temporal segment that corresponds to the text description.
      Select the 5-second segment(s) that best match the query.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "matches_query"
        color: "#22C55E"
        key_value: "m"
    zoom_enabled: true
    playback_rate_control: true
    frame_stepping: true
    show_timecode: true

  - name: "confidence"
    description: "How confident are you that this is the correct moment?"
    annotation_type: radio
    labels:
      - "Very confident - exact match"
      - "Confident - good match"
      - "Somewhat confident - partial match"
      - "Not confident - best guess"

  - name: "ambiguity"
    description: "Is the query ambiguous or could match multiple moments?"
    annotation_type: radio
    labels:
      - "Clear - only one possible match"
      - "Slightly ambiguous - 1-2 alternatives"
      - "Very ambiguous - multiple valid matches"

allow_all_users: true
instances_per_annotator: 60
annotation_per_instance: 3

annotation_instructions: |
  ## Video Moment Retrieval Task

  Given a text description, find the matching moment in the video.

  ### How to Annotate:
  1. Read the query carefully
  2. Watch the video to understand the content
  3. Mark the segment that BEST matches the description
  4. Rate your confidence

  ### Guidelines:
  - Videos are ~30 seconds, divided into 5-second segments
  - Select the segment(s) that match the query
  - If multiple segments match, select all of them
  - If no segment matches well, select the closest one

  ### Example Queries:
  - "The dog runs across the yard"
  - "Someone opens a door"
  - "A person laughs at something"

  ### Tips:
  - Focus on the action/event described
  - Consider synonyms (e.g., "runs" = "sprints")
  - The query may not describe every detail visible

Sample Datasample-data.json

json

[
  {
    "id": "didemo_001",
    "video_url": "https://example.com/videos/flickr_video_001.mp4",
    "query": "A dog jumps into the water",
    "duration_seconds": 30
  },
  {
    "id": "didemo_002",
    "video_url": "https://example.com/videos/flickr_video_002.mp4",
    "query": "Someone picks up a child",
    "duration_seconds": 30
  }
]

// ... and 1 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/temporal-grounding/didemo-moment-retrieval
potato start config.yaml

Dataset & paper

Hendricks et al., ICCV 2017

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{hendricks2017localizing,
  title={Localizing moments in video with natural language},
  author={Hendricks, Lisa Anne and Wang, Oliver and Shechtman, Eli and Sivic, Josef and Darrell, Trevor and Russell, Bryan},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  pages={5803--5812},
  year={2017}
}

Details

Annotation Types

radiovideo_annotation

Domain

Computer VisionNLPVideo Understanding

Use Cases

Moment RetrievalVideo GroundingTemporal Localization

Related Designs

FineSports: Fine-grained Basketball Action Recognition

Fine-grained NBA basketball action recognition. Annotators track the ball-handler across a clip and label a two-level hierarchy of 12 coarse ball-handler actions and 52 fine-grained sub-actions (Xu et al., CVPR 2024).

video_annotationradio

NExT-QA: Causal and Temporal Video QA Benchmark

NExT-QA (Xiao et al., CVPR 2021) is a video QA benchmark for causal and temporal action reasoning over 5,440 videos and ~52K questions. This Potato config reproduces its multiple-choice answer task.

radiotext

Scene Boundary Detection

Identify scene boundaries in documentary and narrative videos. Annotators mark transitions between semantically coherent scenes based on visual, audio, and narrative cues.