Showcase/ActivityNet: Temporal Action Localization Benchmark

intermediatevideo

ActivityNet: Temporal Action Localization Benchmark

ActivityNet is a large-scale video benchmark for human activity understanding, with untrimmed YouTube videos labeled across 200 daily activities. This Potato config reproduces its temporal localization annotation: marking start and end times of activity instances.

About this dataset

ActivityNet was introduced by Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles at CVPR 2015 as a large-scale benchmark for human activity understanding in untrimmed video. The widely used release 1.3 is the standard dataset for the temporal action localization and detection task.

ActivityNet v1.3 covers 200 daily activity classes such as walking the dog, long jump, and vacuuming floor, drawn from YouTube. It holds 19,994 videos totaling around 648 hours, split roughly 50/25/25 across training (10,024), validation (4,926), and testing (5,044).

The temporal localization task asks annotators to find each activity instance in an untrimmed video and mark its precise start and end timestamps, then assign one of the 200 class labels. Videos average about 1.5 activity instances each, so most contain background segments alongside the labeled activity.

The Potato config below reproduces the temporal localization workflow: an annotator watches an untrimmed video, marks the boundaries of each activity instance, and labels it with one of the 200 activity classes.

Activity classes: 200
Videos: 19,994
Total video: ~648 hours
Train / val / test: 10,024 / 4,926 / 5,044
Avg instances per video: ~1.5
Source: YouTube (untrimmed)

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# ActivityNet Temporal Localization Configuration
# Based on Heilbron et al., CVPR 2015
# Task: Localize activity instances with start/end times in untrimmed videos

annotation_task_name: "ActivityNet Temporal Localization"
task_dir: "."

# Data configuration
data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Annotation schemes
annotation_schemes:
  - name: "activity_segments"
    description: |
      Mark the temporal boundaries of each activity instance in the video.
      Draw segments from when the activity STARTS to when it ENDS.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      # Sports activities
      - name: "playing_basketball"
        color: "#F97316"
        key_value: "1"
      - name: "playing_soccer"
        color: "#22C55E"
        key_value: "2"
      - name: "swimming"
        color: "#3B82F6"
        key_value: "3"
      - name: "running"
        color: "#EF4444"
        key_value: "4"
      - name: "gymnastics"
        color: "#A855F7"
        key_value: "5"

      # Household activities
      - name: "cooking"
        color: "#EC4899"
        key_value: "6"
      - name: "cleaning"
        color: "#06B6D4"
        key_value: "7"
      - name: "gardening"
        color: "#84CC16"
        key_value: "8"

      # Personal care
      - name: "brushing_teeth"
        color: "#14B8A6"
        key_value: "9"
      - name: "doing_makeup"
        color: "#F472B6"
        key_value: "0"

      # Music/Performance
      - name: "playing_guitar"
        color: "#8B5CF6"
      - name: "playing_piano"
        color: "#6366F1"
      - name: "singing"
        color: "#D946EF"

      # Outdoor activities
      - name: "hiking"
        color: "#65A30D"
      - name: "fishing"
        color: "#0891B2"
      - name: "camping"
        color: "#059669"

    zoom_enabled: true
    playback_rate_control: true
    frame_stepping: true
    show_timecode: true
    timeline_height: 80
    video_fps: 30

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 40
annotation_per_instance: 2

# Instructions
annotation_instructions: |
  ## ActivityNet Temporal Localization Task

  Your goal is to identify and localize activity instances in untrimmed videos.

  ### What is Temporal Localization?
  - Finding the precise START and END times of activities
  - Videos may contain multiple activities or none at all
  - Activities may overlap or occur sequentially

  ### How to Annotate:
  1. Watch the video to understand its content
  2. For each activity you identify:
     - Select the activity type from the labels
     - Mark the START time (when activity begins)
     - Mark the END time (when activity ends)

  ### Defining Boundaries:
  - **Start**: First frame where the activity is clearly happening
  - **End**: Last frame where the activity is still happening
  - Include preparation if it's part of the activity
  - Exclude unrelated pauses or interruptions

  ### Activity Categories:

  **Sports:** basketball, soccer, swimming, running, gymnastics
  **Household:** cooking, cleaning, gardening
  **Personal Care:** brushing teeth, doing makeup
  **Music:** playing guitar, playing piano, singing
  **Outdoor:** hiking, fishing, camping

  ### Tips:
  - Use slow playback for precise boundaries
  - Zoom the timeline for long videos
  - One video may have multiple instances of the same activity
  - If unsure about boundaries, mark your best estimate
  - Skip segments that don't match any activity class

Sample Datasample-data.json

json

[
  {
    "id": "anet_001",
    "video_url": "https://example.com/videos/basketball_practice.mp4",
    "duration_seconds": 180,
    "source": "youtube",
    "expected_activity": "playing_basketball",
    "description": "Amateur basketball practice session in a gym"
  },
  {
    "id": "anet_002",
    "video_url": "https://example.com/videos/cooking_tutorial.mp4",
    "duration_seconds": 420,
    "source": "youtube",
    "expected_activity": "cooking",
    "description": "Home cooking tutorial - making pasta from scratch"
  }
]

// ... and 3 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/action-recognition/activitynet-temporal-localization
potato start config.yaml

Dataset & paper

Caba Heilbron et al., CVPR 2015

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{caba2015activitynet,
    title={ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding},
    author={Caba Heilbron, Fabian and Escorcia, Victor and Ghanem, Bernard and Niebles, Juan Carlos},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    pages={961--970},
    year={2015}
}

Details

Annotation Types

video_annotation

Domain

Computer VisionVideo Understanding

Use Cases

Activity RecognitionTemporal LocalizationAction Detection

Related Designs

ActivityNet Captions: Dense Video Captioning Dataset

ActivityNet Captions pairs 20k untrimmed videos with 100k temporally localized sentence descriptions for dense-captioning research. This Potato config reproduces the segment-and-describe workflow.

video_annotationtext

AVA: Atomic Visual Actions Dataset

AVA spatio-temporally localizes 80 atomic actions on people in movie clips, labeled at 1 Hz keyframes. This Potato config reproduces the box-and-action labeling task for video clips.

multiselectvideo_annotation

Charades Indoor Activity Segmentation

Multi-label temporal activity segmentation in indoor home videos. Annotators identify action instances using compositional verb-object labels (e.g., 'opening door', 'sitting on chair') with precise temporal boundaries.