YouCook2 Recipe Step Annotation
Annotate cooking videos with recipe step boundaries and descriptions. Segment instructional cooking content into distinct procedural steps.
Configuration Fileconfig.yaml
# YouCook2 Recipe Step Annotation Configuration
# Based on Zhou et al., AAAI 2018
# Task: Segment cooking videos into recipe steps with descriptions
annotation_task_name: "YouCook2 Recipe Step Annotation"
task_dir: "."
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "video_url"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
- name: "recipe_steps"
description: |
Mark the temporal boundaries of each RECIPE STEP.
A step is one distinct cooking action or procedure.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "recipe_step"
color: "#22C55E"
key_value: "r"
frame_stepping: true
show_timecode: true
playback_rate_control: true
video_fps: 30
- name: "step_description"
description: "Describe this recipe step in one sentence:"
annotation_type: text
- name: "step_type"
description: "What type of cooking action is this step?"
annotation_type: radio
labels:
- "Preparation (cutting, measuring, gathering)"
- "Cooking (heating, frying, boiling)"
- "Mixing (combining, stirring, blending)"
- "Seasoning (adding spices, salt, sauce)"
- "Plating (arranging, serving, garnishing)"
- "Other"
- name: "ingredients_visible"
description: "Are the main ingredients clearly visible?"
annotation_type: radio
labels:
- "Yes - all ingredients visible"
- "Partially - some ingredients visible"
- "No - ingredients not clearly shown"
- name: "step_difficulty"
description: "How difficult is this cooking step?"
annotation_type: radio
labels:
- "Easy - basic technique"
- "Moderate - some skill required"
- "Difficult - advanced technique"
allow_all_users: true
instances_per_annotator: 40
annotation_per_instance: 2
annotation_instructions: |
## YouCook2 Recipe Step Annotation
Segment cooking videos into distinct recipe steps and describe each.
### What is a Recipe Step?
- One distinct cooking action
- Has clear beginning and end
- Can be described in one sentence
### Example Steps:
- "Dice the onions into small pieces"
- "Add olive oil to the heated pan"
- "Stir the mixture until smooth"
- "Bake in the oven for 20 minutes"
### Step Description Guidelines:
- Use imperative form ("Add..." not "Adding...")
- Include key ingredients/tools mentioned
- Be specific but concise (5-15 words)
- Don't include timing unless essential
### Boundary Rules:
- START: When the cook begins the action
- END: When the action is complete
- Brief pauses within an action = same step
- Talking without action = exclude if possible
### NOT separate steps:
- Repeated actions (stirring multiple times = one step)
- Camera angle changes during same action
- Brief interruptions
### Tips:
- Watch the whole clip first to understand the recipe
- Typical recipes have 5-15 major steps
- Focus on actions, not commentary
- Some steps may overlap with narration timing
Sample Datasample-data.json
[
{
"id": "youcook_001",
"video_url": "https://example.com/videos/cooking_pasta.mp4",
"recipe": "Pasta Carbonara",
"duration": 300
},
{
"id": "youcook_002",
"video_url": "https://example.com/videos/cooking_salad.mp4",
"recipe": "Caesar Salad",
"duration": 180
}
]Get This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/instructional/youcook2-instructional potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
HowTo100M Instructional Video Annotation
Annotate instructional video clips with step descriptions and visual grounding. Link narrated instructions to visual actions for video-language understanding.
VSTAR Video-grounded Dialogue
Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.
Charades-STA Temporal Grounding
Ground natural language descriptions to video segments. Given a sentence describing an action, identify the exact temporal boundaries where that action occurs.