Ego4D: Egocentric Video Episodic Memory Annotation
Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.
Configuration Fileconfig.yaml
# Ego4D: Egocentric Video Episodic Memory Annotation
# Based on "Ego4D: Around the World in 3,000 Hours of Egocentric Video" (Grauman et al., CVPR 2022)
# Task: Annotate egocentric video with activity segments, hand states, queries, and narrations
annotation_task_name: "Ego4D Episodic Memory Annotation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "video_url"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout for egocentric video
html_layout: |
<div class="ego4d-container" style="font-family: Arial, sans-serif; max-width: 960px; margin: 0 auto;">
<div class="metadata-bar" style="display: flex; gap: 12px; margin-bottom: 14px; flex-wrap: wrap;">
<div style="background: #e8f5e9; padding: 7px 14px; border-radius: 8px;">
<strong>Participant:</strong> {{participant_id}}
</div>
<div style="background: #e3f2fd; padding: 7px 14px; border-radius: 8px;">
<strong>Scenario:</strong> {{scenario}}
</div>
<div style="background: #fff3e0; padding: 7px 14px; border-radius: 8px;">
<strong>Location:</strong> {{location_type}}
</div>
<div style="background: #f3e5f5; padding: 7px 14px; border-radius: 8px;">
<strong>Duration:</strong> {{duration_seconds}}s
</div>
<div style="background: #e0f2f1; padding: 7px 14px; border-radius: 8px;">
<strong>Country:</strong> {{country}}
</div>
</div>
<div class="video-section" style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
<video controls width="100%" style="max-height: 540px; border-radius: 4px;">
<source src="{{video_url}}" type="video/mp4">
Your browser does not support the video tag.
</video>
<p style="color: #bdbdbd; font-size: 12px; margin-top: 8px;">Egocentric (first-person) view</p>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Activity segment annotation on the video timeline
- name: "activity_segments"
description: "Segment the video into activity types. Mark the temporal boundaries of each distinct activity."
annotation_type: video_annotation
mode: segment
labels:
- name: "active-interaction"
color: "#4CAF50"
key_value: "a"
- name: "passive-observation"
color: "#2196F3"
key_value: "o"
- name: "locomotion"
color: "#FF9800"
key_value: "l"
- name: "communication"
color: "#9C27B0"
key_value: "c"
- name: "object-manipulation"
color: "#F44336"
key_value: "m"
- name: "tool-use"
color: "#795548"
key_value: "t"
- name: "food-preparation"
color: "#FF5722"
key_value: "k"
- name: "cleaning"
color: "#607D8B"
key_value: "n"
frame_stepping: true
show_timecode: true
playback_rate_control: true
video_fps: 30
# Hand state annotation tier
- name: "hand_state_tier"
description: "Track the state of the participant's hands throughout the video."
annotation_type: video_annotation
mode: segment
labels:
- name: "both-hands-free"
color: "#81C784"
- name: "left-hand-holding"
color: "#64B5F6"
- name: "right-hand-holding"
color: "#FFB74D"
- name: "both-hands-holding"
color: "#E57373"
- name: "hand-occluded"
color: "#BDBDBD"
frame_stepping: true
show_timecode: true
playback_rate_control: true
video_fps: 30
# Natural language query generation
- name: "natural_language_query"
description: "Write an episodic memory query that could be answered by this video segment (e.g., 'Where did I put my keys?', 'When did I last use the knife?')."
annotation_type: text
# Video narration
- name: "narration"
description: "Describe the activity shown in the video in present tense (e.g., 'The person opens a drawer and takes out a spoon')."
annotation_type: text
# Scene type classification
- name: "scene_type"
description: "What type of scene/environment is shown in the video?"
annotation_type: radio
labels:
- name: "indoor-home"
tooltip: "Kitchen, living room, bedroom, bathroom, garage, etc."
key_value: "1"
- name: "indoor-workplace"
tooltip: "Office, workshop, lab, factory, etc."
key_value: "2"
- name: "indoor-commercial"
tooltip: "Store, restaurant, mall, gym, etc."
key_value: "3"
- name: "outdoor-urban"
tooltip: "Street, parking lot, sidewalk, park in a city"
key_value: "4"
- name: "outdoor-nature"
tooltip: "Garden, trail, forest, beach, farm"
key_value: "5"
- name: "vehicle"
tooltip: "Inside a car, bus, train, bicycle POV"
key_value: "6"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2
# Detailed annotation instructions
annotation_instructions: |
## Ego4D Episodic Memory Annotation
You are annotating egocentric (first-person) video captured from a head-mounted
camera. The goal is to create rich annotations for episodic memory research.
### Your Tasks:
1. **Activity Segmentation**: Mark temporal segments for each distinct activity.
- Use the video timeline to mark start and end of each activity.
- Activities may overlap or transition smoothly.
- Choose the most specific label that applies.
2. **Hand State Tracking**: Track what the participant's hands are doing.
- Mark segments where hand state changes.
- If hands are not visible, use "hand-occluded."
3. **Natural Language Query**: Write a question that this video could answer.
- Frame it as an episodic memory question (first person).
- Examples: "Where did I leave the spatula?", "What did I do after washing my hands?"
- The query should be answerable from the video content.
4. **Narration**: Describe the activity in present tense.
- Be specific about objects and actions.
- Example: "The person picks up a red mug from the counter and pours coffee."
5. **Scene Type**: Classify the environment shown in the video.
### Tips:
- Watch the entire video before annotating.
- Use frame stepping (arrow keys) for precise segment boundaries.
- Egocentric videos may have rapid head movements -- focus on what the hands are doing.
- If an activity is unclear, choose the closest label and note it in the narration.
Sample Datasample-data.json
[
{
"id": "ego4d_001",
"video_url": "https://example.com/ego4d/video_001_cooking_pasta.mp4",
"participant_id": "P001",
"scenario": "cooking",
"location_type": "indoor-home",
"duration_seconds": 180,
"country": "United States"
},
{
"id": "ego4d_002",
"video_url": "https://example.com/ego4d/video_002_woodworking.mp4",
"participant_id": "P014",
"scenario": "crafts",
"location_type": "indoor-workplace",
"duration_seconds": 240,
"country": "Italy"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/ego4d-episodic-memory potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
MVBench Video Understanding
Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., arXiv 2023). Tests temporal perception, action recognition, and state change detection in videos.
NExT-QA - Temporal and Causal Video Question Answering
Temporal and causal video question answering requiring reasoning about actions, events, and their relationships over time. Based on the NExT-QA dataset (Xiao et al., CVPR 2021), annotators answer multiple-choice questions about video content with an emphasis on temporal and causal understanding.
VSTAR Video-grounded Dialogue
Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.