EPIC-KITCHENS Egocentric Action Annotation

Annotate fine-grained actions in egocentric kitchen videos with verb-noun pairs. Identify cooking actions from a first-person perspective.

Configuration Fileconfig.yaml

yaml

# EPIC-KITCHENS Egocentric Action Annotation Configuration
# Based on Damen et al., ECCV 2018
# Task: Annotate verb-noun action pairs in egocentric kitchen videos

annotation_task_name: "EPIC-KITCHENS Egocentric Action Annotation"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "action_segments"
    description: |
      Mark the temporal boundaries of each distinct ACTION.
      An action starts when hands begin moving toward an object
      and ends when the interaction is complete.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "action"
        color: "#22C55E"
        key_value: "a"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 60

  - name: "verb"
    description: "What VERB describes this action?"
    annotation_type: radio
    labels:
      - "take"
      - "put"
      - "open"
      - "close"
      - "wash"
      - "cut"
      - "mix"
      - "pour"
      - "turn-on"
      - "turn-off"
      - "move"
      - "remove"
      - "other"

  - name: "noun"
    description: "What NOUN/OBJECT is being interacted with?"
    annotation_type: radio
    labels:
      - "pan"
      - "plate"
      - "knife"
      - "spoon"
      - "cup"
      - "bowl"
      - "fridge"
      - "tap"
      - "drawer"
      - "cupboard"
      - "food item"
      - "container"
      - "other"

  - name: "verb_free_text"
    description: "If 'other' verb, specify:"
    annotation_type: text

  - name: "noun_free_text"
    description: "If 'other' noun, specify the object:"
    annotation_type: text

  - name: "visibility"
    description: "How visible is the action?"
    annotation_type: radio
    labels:
      - "Fully visible - clear view of hands and object"
      - "Partially visible - some occlusion"
      - "Mostly occluded - hard to see"

allow_all_users: true
instances_per_annotator: 30
annotation_per_instance: 2

annotation_instructions: |
  ## EPIC-KITCHENS Egocentric Action Annotation

  Annotate cooking actions from first-person (egocentric) video.

  ### Task:
  1. Mark the temporal boundaries of each action
  2. Label the VERB (what is being done)
  3. Label the NOUN (what object is involved)

  ### What counts as an action?
  - Any intentional interaction with an object
  - Starts when hands begin reaching/moving
  - Ends when the interaction is complete

  ### Common verb-noun pairs:
  - "take pan", "put plate", "open fridge"
  - "wash spoon", "cut vegetable", "pour water"
  - "turn-on tap", "close drawer", "mix bowl"

  ### Guidelines:
  - One action = one verb + one noun
  - If multiple objects, annotate the PRIMARY one
  - Mark ALL actions, even brief ones
  - Use free text for objects not in the list

  ### Egocentric video tips:
  - Hands often occlude objects - do your best
  - Fast movements may need frame-stepping
  - Camera shake is normal in egocentric video

Sample Datasample-data.json

json

[
  {
    "id": "epic_001",
    "video_url": "https://example.com/videos/kitchen_egocentric_001.mp4",
    "participant": "P01",
    "kitchen": "kitchen_01",
    "duration": 30
  },
  {
    "id": "epic_002",
    "video_url": "https://example.com/videos/kitchen_egocentric_002.mp4",
    "participant": "P01",
    "kitchen": "kitchen_01",
    "duration": 45
  }
]

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/action-recognition/epic-kitchens-egocentric
potato start config.yaml

Details

Annotation Types

radiotextvideo_annotation

Domain

Computer VisionEgocentric VisionActivity Recognition

Use Cases

Action RecognitionVerb-Noun ClassificationKitchen Activities

Related Designs

How2Sign Sign Language Multi-Tier Annotation

Multi-tier ELAN-style annotation of continuous American Sign Language videos. Annotators segment sign glosses, mark mouthing patterns, classify sign handedness, and provide English translations aligned to video timelines. Based on the How2Sign large-scale multimodal ASL dataset.

video_annotationradio

Breakfast Actions Segmentation

Fine-grained temporal action segmentation of breakfast preparation activities. Annotators label sequences of cooking actions like 'take cup', 'pour milk', 'stir'.

textvideo_annotation

Ego4D: Egocentric Video Episodic Memory Annotation

Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.

video_annotationtext