Showcase/Charades Indoor Activity Segmentation

intermediatevideo

Charades Indoor Activity Segmentation

Multi-label temporal activity segmentation in indoor home videos. Annotators identify action instances using compositional verb-object labels (e.g., 'opening door', 'sitting on chair') with precise temporal boundaries.

About this dataset

"Hollywood in Homes" introduced Charades, a dataset for recognizing everyday human activities in video, published by Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta at ECCV 2016. The task is to find each action in a short home video and mark where it begins and ends, using verb-object labels such as "opening a door" or "sitting on a chair." The authors built it to study activities as they occur in ordinary homes rather than in curated web clips.

The data comes from 267 people who wrote scripts, filmed themselves acting out those scripts in their own homes, and then annotated the footage. Annotation covers several layers: free-text descriptions of what happens, temporal intervals marking when each action occurs, and labels for the objects involved.

Charades contains 9,848 videos with an average length of about 30 seconds, spanning 157 action classes and 46 object classes. It includes 66,500 temporally localized action intervals, 27,847 video descriptions, and 41,104 object labels. Training videos were labeled by 4 workers each, and test videos by 8 workers through consensus labeling.

The Potato config below reproduces this task with a video_annotation scheme in segment mode, offering a curated subset of verb-object action labels that annotators attach to start and end timestamps on the timeline. It works as a template for temporal action segmentation on your own indoor-activity clips.

Videos: 9,848
Average length: ~30 seconds
Action classes: 157
Object classes: 46
Temporal action intervals: 66,500
Video descriptions: 27,847

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Charades Indoor Activity Segmentation Configuration
# Based on Sigurdsson et al., ECCV 2016
# Task: Multi-label activity segmentation with compositional verb-object actions

annotation_task_name: "Charades Activity Segmentation"
task_dir: "."

# Data configuration
data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Annotation schemes
annotation_schemes:
  - name: "indoor_activities"
    description: |
      Mark all activity instances in the indoor home video.
      Multiple activities may occur simultaneously or sequentially.
      Use verb-object format labels (e.g., "opening door", "sitting on chair").
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      # Door interactions
      - name: "opening_door"
        color: "#3B82F6"
        key_value: "1"
      - name: "closing_door"
        color: "#1D4ED8"
        key_value: "2"

      # Window interactions
      - name: "opening_window"
        color: "#06B6D4"
      - name: "closing_window"
        color: "#0891B2"

      # Sitting/Standing
      - name: "sitting_on_chair"
        color: "#22C55E"
        key_value: "3"
      - name: "sitting_on_sofa"
        color: "#16A34A"
        key_value: "4"
      - name: "standing_up"
        color: "#84CC16"
        key_value: "5"

      # Object manipulation
      - name: "holding_book"
        color: "#A855F7"
      - name: "putting_down_book"
        color: "#9333EA"
      - name: "holding_phone"
        color: "#D946EF"
      - name: "putting_down_phone"
        color: "#C026D3"

      # Household items
      - name: "opening_refrigerator"
        color: "#F97316"
      - name: "closing_refrigerator"
        color: "#EA580C"
      - name: "drinking_from_cup"
        color: "#EF4444"
        key_value: "6"
      - name: "putting_down_cup"
        color: "#DC2626"

      # Blanket/Pillow
      - name: "taking_blanket"
        color: "#EC4899"
      - name: "putting_blanket"
        color: "#DB2777"
      - name: "holding_pillow"
        color: "#F472B6"

      # TV/Electronics
      - name: "watching_tv"
        color: "#6366F1"
        key_value: "7"
      - name: "turning_on_tv"
        color: "#4F46E5"
      - name: "turning_off_tv"
        color: "#4338CA"

      # Walking
      - name: "walking"
        color: "#F59E0B"
        key_value: "8"
      - name: "running"
        color: "#D97706"

      # Light switches
      - name: "turning_on_light"
        color: "#FACC15"
      - name: "turning_off_light"
        color: "#EAB308"

    zoom_enabled: true
    playback_rate_control: true
    frame_stepping: true
    show_timecode: true
    timeline_height: 100
    video_fps: 24

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2

# Instructions
annotation_instructions: |
  ## Charades Activity Segmentation Task

  Your goal is to identify all activities in short indoor home videos.

  ### Video Characteristics:
  - Duration: ~30 seconds each
  - Setting: Indoor home environments
  - Content: Person(s) performing daily activities
  - Multiple activities often occur in sequence

  ### Annotation Format:
  Activities use **verb + object** composition:
  - "opening door" (not just "opening")
  - "sitting on chair" (not just "sitting")
  - "drinking from cup" (not just "drinking")

  ### How to Annotate:
  1. Watch the entire video first
  2. Replay and mark each activity:
     - Select the activity label
     - Mark START when action begins
     - Mark END when action completes
  3. Activities can OVERLAP (e.g., "holding phone" while "sitting on sofa")

  ### Boundary Guidelines:
  - **Start**: First intentional movement toward the action
  - **End**: Action is complete (door fully open, seated, etc.)
  - Include the full action, not just the peak moment

  ### Common Activity Categories:
  - **Door/Window**: opening, closing
  - **Furniture**: sitting on chair/sofa, standing up
  - **Objects**: holding, putting down (book, phone, cup)
  - **Appliances**: refrigerator, TV, lights
  - **Movement**: walking, running

  ### Tips:
  - Multiple activities can happen simultaneously
  - "Holding" actions continue until the object is put down
  - Don't annotate activities that happen off-screen

Sample Datasample-data.json

json

[
  {
    "id": "charades_001",
    "video_url": "https://example.com/videos/living_room_001.mp4",
    "duration_seconds": 30,
    "scene": "living_room",
    "script": "Person enters, sits on sofa, picks up book, reads",
    "expected_actions": [
      "walking",
      "sitting_on_sofa",
      "holding_book"
    ]
  },
  {
    "id": "charades_002",
    "video_url": "https://example.com/videos/kitchen_001.mp4",
    "duration_seconds": 28,
    "scene": "kitchen",
    "script": "Person opens refrigerator, takes out drink, closes refrigerator, drinks",
    "expected_actions": [
      "opening_refrigerator",
      "closing_refrigerator",
      "drinking_from_cup"
    ]
  }
]

// ... and 3 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/action-recognition/charades-activity-segmentation
potato start config.yaml

Dataset & paper

Sigurdsson et al., ECCV 2016

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{sigurdsson2016hollywood,
    title = {Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
    author = {Sigurdsson, Gunnar A. and Varol, G{\"u}l and Wang, Xiaolong and Farhadi, Ali and Laptev, Ivan and Gupta, Abhinav},
    booktitle = {European Conference on Computer Vision (ECCV)},
    pages = {510--526},
    year = {2016},
    organization = {Springer}
}

Details

Annotation Types

video_annotation

Domain

Computer VisionVideo Understanding

Use Cases

Activity RecognitionAction SegmentationIndoor Scene Understanding

Related Designs

ActivityNet Captions: Dense Video Captioning Dataset

ActivityNet Captions pairs 20k untrimmed videos with 100k temporally localized sentence descriptions for dense-captioning research. This Potato config reproduces the segment-and-describe workflow.

video_annotationtext

ActivityNet: Temporal Action Localization Benchmark

ActivityNet is a large-scale video benchmark for human activity understanding, with untrimmed YouTube videos labeled across 200 daily activities. This Potato config reproduces its temporal localization annotation: marking start and end times of activity instances.