Showcase/AVA: Atomic Visual Actions Dataset

advancedvideo

AVA: Atomic Visual Actions Dataset

AVA spatio-temporally localizes 80 atomic actions on people in movie clips, labeled at 1 Hz keyframes. This Potato config reproduces the box-and-action labeling task for video clips.

About this dataset

AVA (Atomic Visual Actions) is a video action dataset built by Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, and colleagues at Google, published at CVPR 2018.

The dataset covers 430 15-minute movie clips. Keyframes are annotated at a sampling frequency of 1 Hz, which yields close to 900 keyframes per movie, and the released annotations total roughly 1.62M action labels.

AVA defines 80 atomic visual actions rather than composite activities. Each person in a keyframe is localized with a bounding box and labeled with one pose action (such as standing, sitting, or walking) plus optional person-object and person-person interaction labels, so a single actor can carry multiple labels. People are linked across consecutive segments to give spatio-temporal localization for action detection.

The Potato config below reproduces this labeling task: an annotator views a video clip, marks the person of interest, and selects the atomic actions that person performs from the pose, object-interaction, and person-interaction groups.

Atomic actions: 80 classes
Movie clips: 430 clips, 15 min each
Keyframe rate: 1 Hz (~900 keyframes/movie)
Action labels: ~1.62M
Label structure: pose + object + person interactions
Venue: CVPR 2018

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# AVA Atomic Visual Actions Configuration
# Based on Gu et al., CVPR 2018
# Task: Localize people and label their atomic actions in 1-second clips

annotation_task_name: "AVA Atomic Visual Actions"
task_dir: "."

# Data configuration
data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Annotation schemes
annotation_schemes:
  # Track people with bounding boxes
  - name: "person_tracking"
    description: |
      Draw bounding boxes around each person visible in the frame.
      Track the same person across frames using consistent IDs.
    annotation_type: "video_annotation"
    mode: "tracking"
    labels:
      - name: "person"
        color: "#3B82F6"
    frame_stepping: true
    show_timecode: true
    video_fps: 30

  # Classify actions for each tracked person
  # Simplified subset of AVA's 80 action classes
  - name: "pose_actions"
    description: "Select all POSE actions the person is performing"
    annotation_type: multiselect
    labels:
      - "stand"
      - "sit"
      - "lie/sleep"
      - "bend/bow"
      - "crouch/kneel"
      - "walk"
      - "run/jog"
      - "jump/leap"
      - "swim"
      - "dance"
    keyboard_shortcuts:
      stand: "1"
      sit: "2"
      walk: "3"
      run/jog: "4"

  - name: "object_interactions"
    description: "Select all PERSON-OBJECT interactions"
    annotation_type: multiselect
    labels:
      - "carry/hold object"
      - "eat"
      - "drink"
      - "smoke"
      - "read"
      - "write"
      - "play musical instrument"
      - "use phone"
      - "work on computer"
      - "open (door/container)"
      - "close (door/container)"
      - "pour"
      - "throw"
      - "catch"
      - "hit (object)"
      - "kick (object)"
      - "drive"
      - "ride (bike/horse)"

  - name: "person_interactions"
    description: "Select all PERSON-PERSON interactions"
    annotation_type: multiselect
    labels:
      - "talk to"
      - "listen to"
      - "watch (person)"
      - "hug"
      - "kiss"
      - "hand shake"
      - "fight/hit (person)"
      - "push (person)"
      - "give/serve to"
      - "take from"
      - "dance with"
      - "sing to"
      - "martial art"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2

# Instructions
annotation_instructions: |
  ## AVA Atomic Visual Actions Task

  Your goal is to annotate atomic human actions in video clips from movies.

  ### Step 1: Track People
  - Draw bounding boxes around each person visible in the frame
  - Track the same person across frames with consistent IDs
  - Only annotate people who are at least partially visible

  ### Step 2: Label Actions (for each person)
  For each tracked person, select ALL actions they are performing:

  **Pose Actions:** Body position/movement
  - stand, sit, lie, walk, run, jump, dance, etc.

  **Person-Object Interactions:** Actions with objects
  - eating, drinking, using phone, driving, etc.

  **Person-Person Interactions:** Actions with other people
  - talking, hugging, fighting, shaking hands, etc.

  ### Important Notes:
  - Actions are annotated per 1-second intervals
  - One person can have MULTIPLE actions simultaneously
    (e.g., "sit" + "talk to" + "use phone")
  - Focus on the KEYFRAME (middle frame of the 1-second clip)
  - Only annotate clearly visible actions

  ### Action Definitions:
  - **Atomic**: Simple, basic actions (not complex activities)
  - **Visual**: Action must be visually apparent in the frame
  - **Current**: Action happening at this moment, not before/after

Sample Datasample-data.json

json

[
  {
    "id": "clip_001",
    "video_url": "https://example.com/videos/movie_scene_001.mp4",
    "movie_id": "movie_001",
    "timestamp_start": 120,
    "timestamp_end": 121,
    "keyframe_timestamp": 120.5,
    "scene_description": "Indoor office scene - two people having a conversation",
    "num_expected_persons": 2
  },
  {
    "id": "clip_002",
    "video_url": "https://example.com/videos/movie_scene_002.mp4",
    "movie_id": "movie_001",
    "timestamp_start": 245,
    "timestamp_end": 246,
    "keyframe_timestamp": 245.5,
    "scene_description": "Restaurant scene - group dining",
    "num_expected_persons": 4
  }
]

// ... and 3 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/action-recognition/ava-atomic-actions
potato start config.yaml

Dataset & paper

Gu et al., CVPR 2018

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{gu2018ava,
  title={AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}

Details

Annotation Types

multiselectvideo_annotation

Domain

Computer VisionVideo Understanding

Use Cases

Action RecognitionPerson DetectionActivity Understanding

Related Designs

ActivityNet Captions: Dense Video Captioning Dataset

ActivityNet Captions pairs 20k untrimmed videos with 100k temporally localized sentence descriptions for dense-captioning research. This Potato config reproduces the segment-and-describe workflow.

video_annotationtext

ActivityNet: Temporal Action Localization Benchmark

ActivityNet is a large-scale video benchmark for human activity understanding, with untrimmed YouTube videos labeled across 200 daily activities. This Potato config reproduces its temporal localization annotation: marking start and end times of activity instances.

video_annotation

Charades Indoor Activity Segmentation

Multi-label temporal activity segmentation in indoor home videos. Annotators identify action instances using compositional verb-object labels (e.g., 'opening door', 'sitting on chair') with precise temporal boundaries.