HowTo100M Instructional Video Annotation

Annotate instructional video clips with step descriptions and visual grounding. Link narrated instructions to visual actions for video-language understanding.

Configuration Fileconfig.yaml

yaml

# HowTo100M Instructional Video Annotation Configuration
# Based on Miech et al., ICCV 2019
# Task: Annotate instructional steps and visual grounding

annotation_task_name: "HowTo100M Instructional Video Annotation"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "step_segments"
    description: |
      Mark the temporal boundaries of each INSTRUCTIONAL STEP.
      A step is one distinct action or instruction being demonstrated.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "step"
        color: "#22C55E"
        key_value: "s"
      - name: "intro_outro"
        color: "#94A3B8"
        key_value: "i"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 30

  - name: "step_description"
    description: "Describe what is being done in this step (imperative form):"
    annotation_type: text

  - name: "visual_alignment"
    description: "How well does the visual content match what's being said?"
    annotation_type: radio
    labels:
      - "Perfect - visual shows exactly what's narrated"
      - "Good - visual mostly matches narration"
      - "Partial - some mismatch between visual and audio"
      - "Poor - visual doesn't match narration"
      - "No narration in this segment"

  - name: "task_category"
    description: "What category of task is this video?"
    annotation_type: radio
    labels:
      - "Cooking/Food"
      - "Home Repair/DIY"
      - "Crafts/Art"
      - "Beauty/Personal Care"
      - "Fitness/Exercise"
      - "Technology/Software"
      - "Gardening/Outdoor"
      - "Other"

  - name: "step_clarity"
    description: "How clear is this instructional step?"
    annotation_type: radio
    labels:
      - "Very clear - easy to follow"
      - "Clear - understandable"
      - "Somewhat clear - some confusion"
      - "Unclear - hard to follow"

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

annotation_instructions: |
  ## HowTo100M Instructional Video Annotation

  Annotate instructional/tutorial video clips with step boundaries and descriptions.

  ### Task:
  1. Mark the temporal boundaries of each instructional step
  2. Write a brief description of what's being demonstrated
  3. Rate how well the visual matches the narration

  ### What is a step?
  - One distinct action or instruction
  - "Add the flour", "Stir until combined", "Press the button"
  - NOT background talk or transitions

  ### Step descriptions:
  - Use imperative form: "Mix the ingredients" not "The person mixes"
  - Be concise: 3-10 words typically
  - Focus on the ACTION being demonstrated

  ### Visual-Narration Alignment:
  - Perfect: Narrator says "crack the egg" and we see egg cracking
  - Partial: Narrator says "add salt" but we see general cooking
  - Poor: Narrator talks about something not shown

  ### Guidelines:
  - Some clips may have no clear instructional content
  - Mark intro/outro segments separately
  - Narration may be noisy (auto-generated ASR)

  ### Tips:
  - Watch with audio to understand the instruction
  - Steps may overlap with narration timing
  - Focus on what would help someone learn the task

Sample Datasample-data.json

json

[
  {
    "id": "howto_001",
    "video_url": "https://example.com/videos/howto_cooking.mp4",
    "category": "cooking",
    "narration": "First, we're going to add the flour to the bowl",
    "duration": 60
  },
  {
    "id": "howto_002",
    "video_url": "https://example.com/videos/howto_repair.mp4",
    "category": "home_repair",
    "narration": "Now take your screwdriver and remove these screws",
    "duration": 45
  }
]

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/instructional/howto100m-instructional
potato start config.yaml

Details

Annotation Types

radiotextvideo_annotation

Domain

Computer VisionVideo-LanguageInstructional Video

Use Cases

Video-Text AlignmentStep RecognitionProcedural Understanding

Related Designs

YouCook2 Recipe Step Annotation

Annotate cooking videos with recipe step boundaries and descriptions. Segment instructional cooking content into distinct procedural steps.

radiotext

VSTAR Video-grounded Dialogue

Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.

video_annotationtext

Charades-STA Temporal Grounding

Ground natural language descriptions to video segments. Given a sentence describing an action, identify the exact temporal boundaries where that action occurs.

radiovideo_annotation