intermediatevideo
Charades-STA Temporal Grounding
Ground natural language descriptions to video segments. Given a sentence describing an action, identify the exact temporal boundaries where that action occurs.
設定ファイルconfig.yaml
# Charades-STA Temporal Grounding Configuration
# Based on Gao et al., ICCV 2017
# Task: Ground language descriptions to video segments
annotation_task_name: "Charades-STA Temporal Grounding"
task_dir: "."
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "video_url"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
- name: "grounded_segment"
description: |
Mark the EXACT temporal segment where the described action occurs.
Read the query carefully and find where it happens in the video.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "grounded_moment"
color: "#22C55E"
key_value: "g"
frame_stepping: true
show_timecode: true
playback_rate_control: true
video_fps: 24
- name: "grounding_confidence"
description: "How confident are you in the temporal boundaries?"
annotation_type: radio
labels:
- "Very confident - clear start and end"
- "Confident - boundaries are reasonably clear"
- "Somewhat confident - boundaries are approximate"
- "Not confident - hard to determine exact timing"
- name: "query_ambiguity"
description: "Is the language query ambiguous?"
annotation_type: radio
labels:
- "Clear - unambiguous description"
- "Slightly ambiguous - minor interpretation needed"
- "Ambiguous - multiple interpretations possible"
- "Very ambiguous - unclear what to look for"
- name: "action_visible"
description: "Is the described action visible in the video?"
annotation_type: radio
labels:
- "Fully visible - entire action shown"
- "Partially visible - action partly shown"
- "Barely visible - hard to see"
- "Not visible - action not in video"
allow_all_users: true
instances_per_annotator: 60
annotation_per_instance: 2
annotation_instructions: |
## Charades-STA Temporal Grounding
Ground natural language descriptions to video segments.
### Task:
Given a sentence describing an action, mark the EXACT segment where that action occurs.
### Example:
- Query: "Person opens a door"
- Your task: Find and mark the segment where a person opens a door
### Guidelines:
- Read the query BEFORE watching the video
- Mark from action START to action END
- Include preparation if it's part of the action
- The segment should be as tight as possible
### Boundary Rules:
- START: When the person begins the action (not before)
- END: When the action is complete (not after)
- If the action repeats, mark only the FIRST occurrence
### Common issues:
- Actions may be brief (1-2 seconds)
- Multiple similar actions may occur
- Some queries may not have a match (mark "not visible")
### Tips:
- Use frame stepping for precise boundaries
- The action should match the ENTIRE query, not just part
- Pay attention to object mentions ("opens a door" vs "opens a window")
サンプルデータsample-data.json
[
{
"id": "charades_sta_001",
"video_url": "https://example.com/videos/charades_home_001.mp4",
"query": "Person opens a door",
"duration": 30
},
{
"id": "charades_sta_002",
"video_url": "https://example.com/videos/charades_home_002.mp4",
"query": "Person sits down on a couch",
"duration": 25
}
]このデザインを取得
View on GitHub
Clone or download from the repository
クイックスタート:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/temporal-grounding/charades-sta-grounding potato start config.yaml
詳細
アノテーションタイプ
radiovideo_annotation
ドメイン
Computer VisionVideo-LanguageTemporal Grounding
ユースケース
Temporal GroundingVideo-Text AlignmentMoment Retrieval
タグ
videogroundingtemporallanguagelocalizationcharades
問題を見つけた場合やデザインを改善したい場合は?
Issueを作成関連デザイン
HowTo100M Instructional Video Annotation
Annotate instructional video clips with step descriptions and visual grounding. Link narrated instructions to visual actions for video-language understanding.
radiotext
YouCook2 Recipe Step Annotation
Annotate cooking videos with recipe step boundaries and descriptions. Segment instructional cooking content into distinct procedural steps.
radiotext
DiDeMo Moment Retrieval
Localizing natural language descriptions to specific video moments. Given a text query, annotators identify the corresponding temporal segment in the video.
radiovideo_annotation