Breakfast Actions Segmentation

Fine-grained temporal action segmentation of breakfast preparation activities. Annotators label sequences of cooking actions like 'take cup', 'pour milk', 'stir'.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Breakfast Actions Segmentation Configuration
# Based on Kuehne et al., IJCV 2014
# Task: Fine-grained temporal segmentation of breakfast preparation

annotation_task_name: "Breakfast Actions Segmentation"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "breakfast_actions"
    description: |
      Segment the video into fine-grained cooking actions.
      Mark each atomic action from start to finish.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      # Object manipulation
      - name: "take"
        color: "#3B82F6"
        key_value: "t"
      - name: "put"
        color: "#1D4ED8"
        key_value: "p"

      # Pouring actions
      - name: "pour"
        color: "#22C55E"
        key_value: "o"
      - name: "spoon"
        color: "#16A34A"
        key_value: "s"

      # Mixing actions
      - name: "stir"
        color: "#F97316"
        key_value: "r"
      - name: "crack"
        color: "#EA580C"
        key_value: "c"

      # Cutting actions
      - name: "cut"
        color: "#EF4444"
        key_value: "u"
      - name: "peel"
        color: "#DC2626"
        key_value: "l"

      # Cooking actions
      - name: "fry"
        color: "#8B5CF6"
        key_value: "f"
      - name: "butter"
        color: "#A855F7"
        key_value: "b"

      # Other
      - name: "squeeze"
        color: "#EC4899"
        key_value: "q"
      - name: "background"
        color: "#6B7280"
        key_value: "g"

    zoom_enabled: true
    playback_rate_control: true
    frame_stepping: true
    timeline_height: 90

  - name: "object_involved"
    description: "What object is involved in this action?"
    annotation_type: text
    placeholder: "e.g., cup, egg, pan, butter, cereal"

allow_all_users: true
instances_per_annotator: 25
annotation_per_instance: 2

annotation_instructions: |
  ## Breakfast Actions Segmentation Task

  Segment cooking videos into atomic actions.

  ### Action Vocabulary:
  - **take**: Pick up an object
  - **put**: Put down an object
  - **pour**: Pour liquid/granules
  - **spoon**: Scoop with spoon
  - **stir**: Mix with stirring motion
  - **crack**: Crack open (eggs)
  - **cut**: Cut with knife
  - **peel**: Remove outer layer
  - **fry**: Cook in pan
  - **butter**: Spread butter
  - **squeeze**: Squeeze (juice)
  - **background**: Non-action segments

  ### Guidelines:
  - Segment ALL frames (no gaps)
  - Each segment = one atomic action
  - Note the object involved
  - Actions can repeat multiple times

Sample Datasample-data.json

json

[
  {
    "id": "breakfast_001",
    "video_url": "https://example.com/videos/making_cereal.mp4",
    "activity": "cereal",
    "duration_seconds": 120
  },
  {
    "id": "breakfast_002",
    "video_url": "https://example.com/videos/making_pancakes.mp4",
    "activity": "pancake",
    "duration_seconds": 300
  }
]

// ... and 1 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/action-recognition/breakfast-actions
potato start config.yaml

Dataset & paper

Kuehne et al., IJCV 2014

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{kuehne2014language,
  title={The language of actions: Recovering the syntax and semantics of goal-directed human activities},
  author={Kuehne, Hilde and Arslan, Ali and Serre, Thomas},
  journal={International Journal of Computer Vision},
  volume={116},
  number={3},
  pages={259--276},
  year={2014}
}

Details

Annotation Types

textvideo_annotation

Domain

Computer VisionActivity Recognition

Use Cases

Action SegmentationCooking RecognitionProcedure Learning

Related Designs

EPIC-KITCHENS Egocentric Action Annotation

Annotate fine-grained actions in egocentric kitchen videos with verb-noun pairs. Identify cooking actions from a first-person perspective.

radiotext

How2Sign Sign Language Multi-Tier Annotation

Multi-tier ELAN-style annotation of continuous American Sign Language videos. Annotators segment sign glosses, mark mouthing patterns, classify sign handedness, and provide English translations aligned to video timelines. Based on the How2Sign large-scale multimodal ASL dataset.

video_annotationradio

ActivityNet Captions: Dense Video Captioning Dataset

ActivityNet Captions pairs 20k untrimmed videos with 100k temporally localized sentence descriptions for dense-captioning research. This Potato config reproduces the segment-and-describe workflow.

video_annotationtext