HowTo100M Instructional Video Annotation
Annotate instructional video clips with step descriptions and visual grounding. Link narrated instructions to visual actions for video-language understanding.
Fichier de configurationconfig.yaml
# HowTo100M Instructional Video Annotation Configuration
# Based on Miech et al., ICCV 2019
# Task: Annotate instructional steps and visual grounding
annotation_task_name: "HowTo100M Instructional Video Annotation"
task_dir: "."
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "video_url"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
- name: "step_segments"
description: |
Mark the temporal boundaries of each INSTRUCTIONAL STEP.
A step is one distinct action or instruction being demonstrated.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "step"
color: "#22C55E"
key_value: "s"
- name: "intro_outro"
color: "#94A3B8"
key_value: "i"
frame_stepping: true
show_timecode: true
playback_rate_control: true
video_fps: 30
- name: "step_description"
description: "Describe what is being done in this step (imperative form):"
annotation_type: text
- name: "visual_alignment"
description: "How well does the visual content match what's being said?"
annotation_type: radio
labels:
- "Perfect - visual shows exactly what's narrated"
- "Good - visual mostly matches narration"
- "Partial - some mismatch between visual and audio"
- "Poor - visual doesn't match narration"
- "No narration in this segment"
- name: "task_category"
description: "What category of task is this video?"
annotation_type: radio
labels:
- "Cooking/Food"
- "Home Repair/DIY"
- "Crafts/Art"
- "Beauty/Personal Care"
- "Fitness/Exercise"
- "Technology/Software"
- "Gardening/Outdoor"
- "Other"
- name: "step_clarity"
description: "How clear is this instructional step?"
annotation_type: radio
labels:
- "Very clear - easy to follow"
- "Clear - understandable"
- "Somewhat clear - some confusion"
- "Unclear - hard to follow"
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
annotation_instructions: |
## HowTo100M Instructional Video Annotation
Annotate instructional/tutorial video clips with step boundaries and descriptions.
### Task:
1. Mark the temporal boundaries of each instructional step
2. Write a brief description of what's being demonstrated
3. Rate how well the visual matches the narration
### What is a step?
- One distinct action or instruction
- "Add the flour", "Stir until combined", "Press the button"
- NOT background talk or transitions
### Step descriptions:
- Use imperative form: "Mix the ingredients" not "The person mixes"
- Be concise: 3-10 words typically
- Focus on the ACTION being demonstrated
### Visual-Narration Alignment:
- Perfect: Narrator says "crack the egg" and we see egg cracking
- Partial: Narrator says "add salt" but we see general cooking
- Poor: Narrator talks about something not shown
### Guidelines:
- Some clips may have no clear instructional content
- Mark intro/outro segments separately
- Narration may be noisy (auto-generated ASR)
### Tips:
- Watch with audio to understand the instruction
- Steps may overlap with narration timing
- Focus on what would help someone learn the task
Données d'exemplesample-data.json
[
{
"id": "howto_001",
"video_url": "https://example.com/videos/howto_cooking.mp4",
"category": "cooking",
"narration": "First, we're going to add the flour to the bowl",
"duration": 60
},
{
"id": "howto_002",
"video_url": "https://example.com/videos/howto_repair.mp4",
"category": "home_repair",
"narration": "Now take your screwdriver and remove these screws",
"duration": 45
}
]Obtenir ce design
Clone or download from the repository
Démarrage rapide :
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/instructional/howto100m-instructional potato start config.yaml
Détails
Types d'annotation
Domaine
Cas d'utilisation
Étiquettes
Vous avez trouvé un problème ou souhaitez améliorer ce design ?
Ouvrir un ticketDesigns associés
YouCook2 Recipe Step Annotation
Annotate cooking videos with recipe step boundaries and descriptions. Segment instructional cooking content into distinct procedural steps.
VSTAR Video-grounded Dialogue
Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.
Charades-STA Temporal Grounding
Ground natural language descriptions to video segments. Given a sentence describing an action, identify the exact temporal boundaries where that action occurs.