Charades-STA Temporal Grounding

Ground natural language descriptions to video segments. Given a sentence describing an action, identify the exact temporal boundaries where that action occurs.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Charades-STA Temporal Grounding Configuration
# Based on Gao et al., ICCV 2017
# Task: Ground language descriptions to video segments

annotation_task_name: "Charades-STA Temporal Grounding"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "grounded_segment"
    description: |
      Mark the EXACT temporal segment where the described action occurs.
      Read the query carefully and find where it happens in the video.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "grounded_moment"
        color: "#22C55E"
        key_value: "g"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 24

  - name: "grounding_confidence"
    description: "How confident are you in the temporal boundaries?"
    annotation_type: radio
    labels:
      - "Very confident - clear start and end"
      - "Confident - boundaries are reasonably clear"
      - "Somewhat confident - boundaries are approximate"
      - "Not confident - hard to determine exact timing"

  - name: "query_ambiguity"
    description: "Is the language query ambiguous?"
    annotation_type: radio
    labels:
      - "Clear - unambiguous description"
      - "Slightly ambiguous - minor interpretation needed"
      - "Ambiguous - multiple interpretations possible"
      - "Very ambiguous - unclear what to look for"

  - name: "action_visible"
    description: "Is the described action visible in the video?"
    annotation_type: radio
    labels:
      - "Fully visible - entire action shown"
      - "Partially visible - action partly shown"
      - "Barely visible - hard to see"
      - "Not visible - action not in video"

allow_all_users: true
instances_per_annotator: 60
annotation_per_instance: 2

annotation_instructions: |
  ## Charades-STA Temporal Grounding

  Ground natural language descriptions to video segments.

  ### Task:
  Given a sentence describing an action, mark the EXACT segment where that action occurs.

  ### Example:
  - Query: "Person opens a door"
  - Your task: Find and mark the segment where a person opens a door

  ### Guidelines:
  - Read the query BEFORE watching the video
  - Mark from action START to action END
  - Include preparation if it's part of the action
  - The segment should be as tight as possible

  ### Boundary Rules:
  - START: When the person begins the action (not before)
  - END: When the action is complete (not after)
  - If the action repeats, mark only the FIRST occurrence

  ### Common issues:
  - Actions may be brief (1-2 seconds)
  - Multiple similar actions may occur
  - Some queries may not have a match (mark "not visible")

  ### Tips:
  - Use frame stepping for precise boundaries
  - The action should match the ENTIRE query, not just part
  - Pay attention to object mentions ("opens a door" vs "opens a window")

Sample Datasample-data.json

json

[
  {
    "id": "charades_sta_001",
    "video_url": "https://example.com/videos/charades_home_001.mp4",
    "query": "Person opens a door",
    "duration": 30
  },
  {
    "id": "charades_sta_002",
    "video_url": "https://example.com/videos/charades_home_002.mp4",
    "query": "Person sits down on a couch",
    "duration": 25
  }
]

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/temporal-grounding/charades-sta-grounding
potato start config.yaml

Dataset & paper

Gao et al., ICCV 2017

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{gao2017tall,
  title={TALL: Temporal activity localization via language query},
  author={Gao, Jiyang and Sun, Chen and Yang, Zhenheng and Nevatia, Ram},
  booktitle={IEEE International Conference on Computer Vision},
  pages={5267--5275},
  year={2017}
}

Details

Annotation Types

radiovideo_annotation

Domain

Computer VisionVideo-LanguageTemporal Grounding

Use Cases

Temporal GroundingVideo-Text AlignmentMoment Retrieval

Related Designs

HowTo100M Instructional Video Annotation

Annotate instructional video clips with step descriptions and visual grounding. Link narrated instructions to visual actions for video-language understanding.

radiotext

YouCook2 Dataset: Cooking Video Recipe-Step Annotation

YouCook2 contains 2,000 cooking videos across 89 recipes (176 hours), each segmented into recipe steps with temporal boundaries and imperative captions (AAAI 2018). Dataset and paper links plus a Potato config for procedural video annotation.