EPIC-KITCHENS Egocentric Action Annotation
Annotate fine-grained actions in egocentric kitchen videos with verb-noun pairs. Identify cooking actions from a first-person perspective.
Configuration Fileconfig.yaml
# EPIC-KITCHENS Egocentric Action Annotation Configuration
# Based on Damen et al., ECCV 2018
# Task: Annotate verb-noun action pairs in egocentric kitchen videos
annotation_task_name: "EPIC-KITCHENS Egocentric Action Annotation"
task_dir: "."
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "video_url"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
- name: "action_segments"
description: |
Mark the temporal boundaries of each distinct ACTION.
An action starts when hands begin moving toward an object
and ends when the interaction is complete.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "action"
color: "#22C55E"
key_value: "a"
frame_stepping: true
show_timecode: true
playback_rate_control: true
video_fps: 60
- name: "verb"
description: "What VERB describes this action?"
annotation_type: radio
labels:
- "take"
- "put"
- "open"
- "close"
- "wash"
- "cut"
- "mix"
- "pour"
- "turn-on"
- "turn-off"
- "move"
- "remove"
- "other"
- name: "noun"
description: "What NOUN/OBJECT is being interacted with?"
annotation_type: radio
labels:
- "pan"
- "plate"
- "knife"
- "spoon"
- "cup"
- "bowl"
- "fridge"
- "tap"
- "drawer"
- "cupboard"
- "food item"
- "container"
- "other"
- name: "verb_free_text"
description: "If 'other' verb, specify:"
annotation_type: text
- name: "noun_free_text"
description: "If 'other' noun, specify the object:"
annotation_type: text
- name: "visibility"
description: "How visible is the action?"
annotation_type: radio
labels:
- "Fully visible - clear view of hands and object"
- "Partially visible - some occlusion"
- "Mostly occluded - hard to see"
allow_all_users: true
instances_per_annotator: 30
annotation_per_instance: 2
annotation_instructions: |
## EPIC-KITCHENS Egocentric Action Annotation
Annotate cooking actions from first-person (egocentric) video.
### Task:
1. Mark the temporal boundaries of each action
2. Label the VERB (what is being done)
3. Label the NOUN (what object is involved)
### What counts as an action?
- Any intentional interaction with an object
- Starts when hands begin reaching/moving
- Ends when the interaction is complete
### Common verb-noun pairs:
- "take pan", "put plate", "open fridge"
- "wash spoon", "cut vegetable", "pour water"
- "turn-on tap", "close drawer", "mix bowl"
### Guidelines:
- One action = one verb + one noun
- If multiple objects, annotate the PRIMARY one
- Mark ALL actions, even brief ones
- Use free text for objects not in the list
### Egocentric video tips:
- Hands often occlude objects - do your best
- Fast movements may need frame-stepping
- Camera shake is normal in egocentric video
Sample Datasample-data.json
[
{
"id": "epic_001",
"video_url": "https://example.com/videos/kitchen_egocentric_001.mp4",
"participant": "P01",
"kitchen": "kitchen_01",
"duration": 30
},
{
"id": "epic_002",
"video_url": "https://example.com/videos/kitchen_egocentric_002.mp4",
"participant": "P01",
"kitchen": "kitchen_01",
"duration": 45
}
]Get This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/action-recognition/epic-kitchens-egocentric potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
How2Sign Sign Language Multi-Tier Annotation
Multi-tier ELAN-style annotation of continuous American Sign Language videos. Annotators segment sign glosses, mark mouthing patterns, classify sign handedness, and provide English translations aligned to video timelines. Based on the How2Sign large-scale multimodal ASL dataset.
Breakfast Actions Segmentation
Fine-grained temporal action segmentation of breakfast preparation activities. Annotators label sequences of cooking actions like 'take cup', 'pour milk', 'stir'.
Ego4D: Egocentric Video Episodic Memory Annotation
Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.