Showcase/YouCook2 Dataset: Cooking Video Recipe-Step Annotation

intermediatevideo

YouCook2 Dataset: Cooking Video Recipe-Step Annotation

YouCook2 contains 2,000 cooking videos across 89 recipes (176 hours), each segmented into recipe steps with temporal boundaries and imperative captions (AAAI 2018). Dataset and paper links plus a Potato config for procedural video annotation.

About this dataset

YouCook2 is one of the largest task-oriented instructional video datasets, introduced by Zhou, Xu, and Corso at AAAI 2018. It collects unconstrained cooking videos from YouTube and annotates the procedural structure of each recipe.

The dataset has 2,000 videos spanning 89 distinct recipes and 176 total hours, with an average length of 5.26 minutes. Each video is segmented into recipe steps (7.7 on average) marked with start and end timestamps, and every step carries an imperative English description such as "add olive oil to the heated pan."

Because it pairs temporal boundaries with natural-language descriptions, YouCook2 is a standard benchmark for procedure segmentation, dense video captioning, and video-language grounding. Each video was labeled by two annotators, one for the main pass and one for verification.

The Potato config below reproduces the recipe-step annotation task: segment-mode video annotation for step boundaries, a free-text field for the step description, and radio schemes for step type, ingredient visibility, and difficulty. Use it to re-annotate cooking videos or to build a similar procedural dataset in another domain.

Released: AAAI 2018
Videos: 2,000 cooking videos
Recipes: 89 distinct recipes
Total length: 176 hours (avg 5.26 min/video)
Steps per video: 7.7 on average
Per-step labels: Temporal boundary + imperative caption

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# YouCook2 Recipe Step Annotation Configuration
# Based on Zhou et al., AAAI 2018
# Task: Segment cooking videos into recipe steps with descriptions

annotation_task_name: "YouCook2 Recipe Step Annotation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "recipe_steps"
    description: |
      Mark the temporal boundaries of each RECIPE STEP.
      A step is one distinct cooking action or procedure.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "recipe_step"
        color: "#22C55E"
        key_value: "r"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 30

  - name: "step_description"
    description: "Describe this recipe step in one sentence:"
    annotation_type: text

  - name: "step_type"
    description: "What type of cooking action is this step?"
    annotation_type: radio
    labels:
      - "Preparation (cutting, measuring, gathering)"
      - "Cooking (heating, frying, boiling)"
      - "Mixing (combining, stirring, blending)"
      - "Seasoning (adding spices, salt, sauce)"
      - "Plating (arranging, serving, garnishing)"
      - "Other"

  - name: "ingredients_visible"
    description: "Are the main ingredients clearly visible?"
    annotation_type: radio
    labels:
      - "Yes - all ingredients visible"
      - "Partially - some ingredients visible"
      - "No - ingredients not clearly shown"

  - name: "step_difficulty"
    description: "How difficult is this cooking step?"
    annotation_type: radio
    labels:
      - "Easy - basic technique"
      - "Moderate - some skill required"
      - "Difficult - advanced technique"

allow_all_users: true
instances_per_annotator: 40
annotation_per_instance: 2

annotation_instructions: |
  ## YouCook2 Recipe Step Annotation

  Segment cooking videos into distinct recipe steps and describe each.

  ### What is a Recipe Step?
  - One distinct cooking action
  - Has clear beginning and end
  - Can be described in one sentence

  ### Example Steps:
  - "Dice the onions into small pieces"
  - "Add olive oil to the heated pan"
  - "Stir the mixture until smooth"
  - "Bake in the oven for 20 minutes"

  ### Step Description Guidelines:
  - Use imperative form ("Add..." not "Adding...")
  - Include key ingredients/tools mentioned
  - Be specific but concise (5-15 words)
  - Don't include timing unless essential

  ### Boundary Rules:
  - START: When the cook begins the action
  - END: When the action is complete
  - Brief pauses within an action = same step
  - Talking without action = exclude if possible

  ### NOT separate steps:
  - Repeated actions (stirring multiple times = one step)
  - Camera angle changes during same action
  - Brief interruptions

  ### Tips:
  - Watch the whole clip first to understand the recipe
  - Typical recipes have 5-15 major steps
  - Focus on actions, not commentary
  - Some steps may overlap with narration timing

Sample Datasample-data.json

json

[
  {
    "id": "youcook_001",
    "video_url": "https://example.com/videos/cooking_pasta.mp4",
    "recipe": "Pasta Carbonara",
    "duration": 300
  },
  {
    "id": "youcook_002",
    "video_url": "https://example.com/videos/cooking_salad.mp4",
    "recipe": "Caesar Salad",
    "duration": 180
  }
]

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/instructional/youcook2-instructional
potato start config.yaml

Dataset & paper

Zhou et al., AAAI 2018

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{zhou2018towards,
  title={Towards automatic learning of procedures from web instructional videos},
  author={Zhou, Luowei and Xu, Chenliang and Corso, Jason J},
  booktitle={AAAI Conference on Artificial Intelligence},
  volume={32},
  number={1},
  year={2018}
}

Details

Annotation Types

radiotextvideo_annotation

Domain

Computer VisionVideo-LanguageInstructional Video

Use Cases

Dense Video CaptioningRecipe UnderstandingProcedural Learning

Related Designs

HowTo100M Instructional Video Annotation

Annotate instructional video clips with step descriptions and visual grounding. Link narrated instructions to visual actions for video-language understanding.

radiotext

VSTAR Video-grounded Dialogue

Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.

video_annotationtext

Charades-STA Temporal Grounding

Ground natural language descriptions to video segments. Given a sentence describing an action, identify the exact temporal boundaries where that action occurs.

radiovideo_annotation