Skip to content
Showcase/Ego4D: Egocentric Video Episodic Memory Annotation
advancedvideo

Ego4D: Egocentric Video Episodic Memory Annotation

Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.

Frame 847 / 3200Running01:12 - 01:28Segments:WalkRunStandActionWalkRunStandWalkSceneOutdoorIndoorDrag to create and label temporal segments

Configuration Fileconfig.yaml

# Ego4D: Egocentric Video Episodic Memory Annotation
# Based on "Ego4D: Around the World in 3,000 Hours of Egocentric Video" (Grauman et al., CVPR 2022)
# Task: Annotate egocentric video with activity segments, hand states, queries, and narrations

annotation_task_name: "Ego4D Episodic Memory Annotation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout for egocentric video
html_layout: |
  <div class="ego4d-container" style="font-family: Arial, sans-serif; max-width: 960px; margin: 0 auto;">
    <div class="metadata-bar" style="display: flex; gap: 12px; margin-bottom: 14px; flex-wrap: wrap;">
      <div style="background: #e8f5e9; padding: 7px 14px; border-radius: 8px;">
        <strong>Participant:</strong> {{participant_id}}
      </div>
      <div style="background: #e3f2fd; padding: 7px 14px; border-radius: 8px;">
        <strong>Scenario:</strong> {{scenario}}
      </div>
      <div style="background: #fff3e0; padding: 7px 14px; border-radius: 8px;">
        <strong>Location:</strong> {{location_type}}
      </div>
      <div style="background: #f3e5f5; padding: 7px 14px; border-radius: 8px;">
        <strong>Duration:</strong> {{duration_seconds}}s
      </div>
      <div style="background: #e0f2f1; padding: 7px 14px; border-radius: 8px;">
        <strong>Country:</strong> {{country}}
      </div>
    </div>
    <div class="video-section" style="background: #212121; padding: 12px; border-radius: 8px; margin-bottom: 16px; text-align: center;">
      <video controls width="100%" style="max-height: 540px; border-radius: 4px;">
        <source src="{{video_url}}" type="video/mp4">
        Your browser does not support the video tag.
      </video>
      <p style="color: #bdbdbd; font-size: 12px; margin-top: 8px;">Egocentric (first-person) view</p>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Activity segment annotation on the video timeline
  - name: "activity_segments"
    description: "Segment the video into activity types. Mark the temporal boundaries of each distinct activity."
    annotation_type: video_annotation
    mode: segment
    labels:
      - name: "active-interaction"
        color: "#4CAF50"
        key_value: "a"
      - name: "passive-observation"
        color: "#2196F3"
        key_value: "o"
      - name: "locomotion"
        color: "#FF9800"
        key_value: "l"
      - name: "communication"
        color: "#9C27B0"
        key_value: "c"
      - name: "object-manipulation"
        color: "#F44336"
        key_value: "m"
      - name: "tool-use"
        color: "#795548"
        key_value: "t"
      - name: "food-preparation"
        color: "#FF5722"
        key_value: "k"
      - name: "cleaning"
        color: "#607D8B"
        key_value: "n"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 30

  # Hand state annotation tier
  - name: "hand_state_tier"
    description: "Track the state of the participant's hands throughout the video."
    annotation_type: video_annotation
    mode: segment
    labels:
      - name: "both-hands-free"
        color: "#81C784"
      - name: "left-hand-holding"
        color: "#64B5F6"
      - name: "right-hand-holding"
        color: "#FFB74D"
      - name: "both-hands-holding"
        color: "#E57373"
      - name: "hand-occluded"
        color: "#BDBDBD"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    video_fps: 30

  # Natural language query generation
  - name: "natural_language_query"
    description: "Write an episodic memory query that could be answered by this video segment (e.g., 'Where did I put my keys?', 'When did I last use the knife?')."
    annotation_type: text

  # Video narration
  - name: "narration"
    description: "Describe the activity shown in the video in present tense (e.g., 'The person opens a drawer and takes out a spoon')."
    annotation_type: text

  # Scene type classification
  - name: "scene_type"
    description: "What type of scene/environment is shown in the video?"
    annotation_type: radio
    labels:
      - name: "indoor-home"
        tooltip: "Kitchen, living room, bedroom, bathroom, garage, etc."
        key_value: "1"
      - name: "indoor-workplace"
        tooltip: "Office, workshop, lab, factory, etc."
        key_value: "2"
      - name: "indoor-commercial"
        tooltip: "Store, restaurant, mall, gym, etc."
        key_value: "3"
      - name: "outdoor-urban"
        tooltip: "Street, parking lot, sidewalk, park in a city"
        key_value: "4"
      - name: "outdoor-nature"
        tooltip: "Garden, trail, forest, beach, farm"
        key_value: "5"
      - name: "vehicle"
        tooltip: "Inside a car, bus, train, bicycle POV"
        key_value: "6"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2

# Detailed annotation instructions
annotation_instructions: |
  ## Ego4D Episodic Memory Annotation

  You are annotating egocentric (first-person) video captured from a head-mounted
  camera. The goal is to create rich annotations for episodic memory research.

  ### Your Tasks:

  1. **Activity Segmentation**: Mark temporal segments for each distinct activity.
     - Use the video timeline to mark start and end of each activity.
     - Activities may overlap or transition smoothly.
     - Choose the most specific label that applies.

  2. **Hand State Tracking**: Track what the participant's hands are doing.
     - Mark segments where hand state changes.
     - If hands are not visible, use "hand-occluded."

  3. **Natural Language Query**: Write a question that this video could answer.
     - Frame it as an episodic memory question (first person).
     - Examples: "Where did I leave the spatula?", "What did I do after washing my hands?"
     - The query should be answerable from the video content.

  4. **Narration**: Describe the activity in present tense.
     - Be specific about objects and actions.
     - Example: "The person picks up a red mug from the counter and pours coffee."

  5. **Scene Type**: Classify the environment shown in the video.

  ### Tips:
  - Watch the entire video before annotating.
  - Use frame stepping (arrow keys) for precise segment boundaries.
  - Egocentric videos may have rapid head movements -- focus on what the hands are doing.
  - If an activity is unclear, choose the closest label and note it in the narration.

Sample Datasample-data.json

[
  {
    "id": "ego4d_001",
    "video_url": "https://example.com/ego4d/video_001_cooking_pasta.mp4",
    "participant_id": "P001",
    "scenario": "cooking",
    "location_type": "indoor-home",
    "duration_seconds": 180,
    "country": "United States"
  },
  {
    "id": "ego4d_002",
    "video_url": "https://example.com/ego4d/video_002_woodworking.mp4",
    "participant_id": "P014",
    "scenario": "crafts",
    "location_type": "indoor-workplace",
    "duration_seconds": 240,
    "country": "Italy"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/ego4d-episodic-memory
potato start config.yaml

Details

Annotation Types

video_annotationtextradio

Domain

Egocentric VisionEpisodic MemoryVideo Understanding

Use Cases

Activity RecognitionTemporal SegmentationVideo Narration

Tags

egocentricepisodic-memoryego4dcvpr2022first-person-videotemporal-grounding

Found an issue or want to improve this design?

Open an Issue