Ego4D: Egocentric Video Episodic Memory Annotation

Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Ego4D: Egocentric Video Episodic Memory Annotation
# Based on "Ego4D: Around the World in 3,000 Hours of Egocentric Video" (Grauman et al., CVPR 2022)
# Task: Annotate egocentric video with activity segments, hand states, queries, and narrations

annotation_task_name: "Ego4D Episodic Memory Annotation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout for egocentric video
html_layout: |
  <div class="ego4d-container" style="font-family: Arial, sans-serif; max-width: 960px; margin: 0 auto;">
    <div class="metadata-bar" style="display: flex; gap: 12px; margin-bottom: 14px; flex-wrap: wrap;">
      <div style="background: #e8f5e9; padding: 7px 14px; border-radius: 8px;">
        <strong>Participant:</strong> {{participant_id}}
      </div>
      <div style="background: #e3f2fd; padding: 7px 14px; border-radius: 8px;">
        <strong>Scenario:</strong> {{scenario}}
      </div>
      <div style="background: #fff3e0; padding: 7px 14px; border-radius: 8px;">
        <strong>Location:</strong> {{location_type}}
      </div>
      <div style="background: #f3e5f5; padding: 7px 14px; border-radius: 8px;">
        <strong>Duration:</strong> {{duration_seconds}}s
      </div>
      <div style="background: #e0f2f1; padding: 7px 14px; border-radius: 8px;">
        <strong>Country:</strong> {{country}}
      </div>
    </div>
    <div class="video-section" style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
      <video controls width="100%" style="max-height: 540px; border-radius: 4px;">
        <source src="{{video_url}}" type="video/mp4">
        Your browser does not support the video tag.
      </video>
      <p style="color: #bdbdbd; font-size: 12px; margin-top: 8px;">Egocentric (first-person) view</p>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Activity segment annotation on the video timeline
  - name: "activity_segments"
    description: "Segment the video into activity types. Mark the temporal boundaries of each distinct activity."
    annotation_type: video_annotation
    mode: segment
    labels:
      - name: "active-interaction"
        color: "#4CAF50"
        key_value: "a"
      - name: "passive-observation"
        color: "#2196F3"
        key_value: "o"
      - name: "locomotion"
        color: "#FF9800"
        key_value: "l"
      - name: "communication"
        color: "#9C27B0"
        key_value: "c"
      - name: "object-manipulation"
        color: "#F44336"
        key_value: "m"
      - name: "tool-use"
        color: "#795548"
        key_value: "t"
      - name: "food-preparation"
        color: "#FF5722"
        key_value: "k"
      - name: "cleaning"
        color: "#607D8B"
        key_value: "n"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 30

  # Hand state annotation tier
  - name: "hand_state_tier"
    description: "Track the state of the participant's hands throughout the video."
    annotation_type: video_annotation
    mode: segment
    labels:
      - name: "both-hands-free"
        color: "#81C784"
      - name: "left-hand-holding"
        color: "#64B5F6"
      - name: "right-hand-holding"
        color: "#FFB74D"
      - name: "both-hands-holding"
        color: "#E57373"
      - name: "hand-occluded"
        color: "#BDBDBD"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 30

  # Natural language query generation
  - name: "natural_language_query"
    description: "Write an episodic memory query that could be answered by this video segment (e.g., 'Where did I put my keys?', 'When did I last use the knife?')."
    annotation_type: text

  # Video narration
  - name: "narration"
    description: "Describe the activity shown in the video in present tense (e.g., 'The person opens a drawer and takes out a spoon')."
    annotation_type: text

  # Scene type classification
  - name: "scene_type"
    description: "What type of scene/environment is shown in the video?"
    annotation_type: radio
    labels:
      - name: "indoor-home"
        tooltip: "Kitchen, living room, bedroom, bathroom, garage, etc."
        key_value: "1"
      - name: "indoor-workplace"
        tooltip: "Office, workshop, lab, factory, etc."
        key_value: "2"
      - name: "indoor-commercial"
        tooltip: "Store, restaurant, mall, gym, etc."
        key_value: "3"
      - name: "outdoor-urban"
        tooltip: "Street, parking lot, sidewalk, park in a city"
        key_value: "4"
      - name: "outdoor-nature"
        tooltip: "Garden, trail, forest, beach, farm"
        key_value: "5"
      - name: "vehicle"
        tooltip: "Inside a car, bus, train, bicycle POV"
        key_value: "6"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2

# Detailed annotation instructions
annotation_instructions: |
  ## Ego4D Episodic Memory Annotation

  You are annotating egocentric (first-person) video captured from a head-mounted
  camera. The goal is to create rich annotations for episodic memory research.

  ### Your Tasks:

  1. **Activity Segmentation**: Mark temporal segments for each distinct activity.
     - Use the video timeline to mark start and end of each activity.
     - Activities may overlap or transition smoothly.
     - Choose the most specific label that applies.

  2. **Hand State Tracking**: Track what the participant's hands are doing.
     - Mark segments where hand state changes.
     - If hands are not visible, use "hand-occluded."

  3. **Natural Language Query**: Write a question that this video could answer.
     - Frame it as an episodic memory question (first person).
     - Examples: "Where did I leave the spatula?", "What did I do after washing my hands?"
     - The query should be answerable from the video content.

  4. **Narration**: Describe the activity in present tense.
     - Be specific about objects and actions.
     - Example: "The person picks up a red mug from the counter and pours coffee."

  5. **Scene Type**: Classify the environment shown in the video.

  ### Tips:
  - Watch the entire video before annotating.
  - Use frame stepping (arrow keys) for precise segment boundaries.
  - Egocentric videos may have rapid head movements -- focus on what the hands are doing.
  - If an activity is unclear, choose the closest label and note it in the narration.

Sample Datasample-data.json

json

[
  {
    "id": "ego4d_001",
    "video_url": "https://example.com/ego4d/video_001_cooking_pasta.mp4",
    "participant_id": "P001",
    "scenario": "cooking",
    "location_type": "indoor-home",
    "duration_seconds": 180,
    "country": "United States"
  },
  {
    "id": "ego4d_002",
    "video_url": "https://example.com/ego4d/video_002_woodworking.mp4",
    "participant_id": "P014",
    "scenario": "crafts",
    "location_type": "indoor-workplace",
    "duration_seconds": 240,
    "country": "Italy"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/ego4d-episodic-memory
potato start config.yaml

Dataset & paper

Grauman et al., CVPR 2022

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{grauman2022ego4d,
  title={Ego4D: Around the World in 3,000 Hours of Egocentric Video},
  author={Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={18995--19012},
  year={2022}
}

Details

Annotation Types

video_annotationtextradio

Domain

Egocentric VisionEpisodic MemoryVideo Understanding

Use Cases

Activity RecognitionTemporal SegmentationVideo Narration

Related Designs

MVBench Video Understanding

Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., CVPR 2024). Tests temporal perception, action recognition, and state change detection in videos.

radiotext

NExT-QA: Causal and Temporal Video QA Benchmark

NExT-QA (Xiao et al., CVPR 2021) is a video QA benchmark for causal and temporal action reasoning over 5,440 videos and ~52K questions. This Potato config reproduces its multiple-choice answer task.