Skip to content
Showcase/AVA Atomic Visual Actions
advancedvideo

AVA Atomic Visual Actions

Spatio-temporal action annotation in movie clips. Annotators localize people with bounding boxes and label their atomic actions (pose, person-object, person-person interactions) in 1-second intervals.

Frame 847 / 3200Running01:12 - 01:28Segments:WalkRunStandActionWalkRunStandWalkSceneOutdoorIndoorDrag to create and label temporal segments

Configuration Fileconfig.yaml

# AVA Atomic Visual Actions Configuration
# Based on Gu et al., CVPR 2018
# Task: Localize people and label their atomic actions in 1-second clips

annotation_task_name: "AVA Atomic Visual Actions"
task_dir: "."

# Data configuration
data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Annotation schemes
annotation_schemes:
  # Track people with bounding boxes
  - name: "person_tracking"
    description: |
      Draw bounding boxes around each person visible in the frame.
      Track the same person across frames using consistent IDs.
    annotation_type: "video_annotation"
    mode: "tracking"
    labels:
      - name: "person"
        color: "#3B82F6"
    frame_stepping: true
    show_timecode: true
    video_fps: 30

  # Classify actions for each tracked person
  # Simplified subset of AVA's 80 action classes
  - name: "pose_actions"
    description: "Select all POSE actions the person is performing"
    annotation_type: multiselect
    labels:
      - "stand"
      - "sit"
      - "lie/sleep"
      - "bend/bow"
      - "crouch/kneel"
      - "walk"
      - "run/jog"
      - "jump/leap"
      - "swim"
      - "dance"
    keyboard_shortcuts:
      stand: "1"
      sit: "2"
      walk: "3"
      run/jog: "4"

  - name: "object_interactions"
    description: "Select all PERSON-OBJECT interactions"
    annotation_type: multiselect
    labels:
      - "carry/hold object"
      - "eat"
      - "drink"
      - "smoke"
      - "read"
      - "write"
      - "play musical instrument"
      - "use phone"
      - "work on computer"
      - "open (door/container)"
      - "close (door/container)"
      - "pour"
      - "throw"
      - "catch"
      - "hit (object)"
      - "kick (object)"
      - "drive"
      - "ride (bike/horse)"

  - name: "person_interactions"
    description: "Select all PERSON-PERSON interactions"
    annotation_type: multiselect
    labels:
      - "talk to"
      - "listen to"
      - "watch (person)"
      - "hug"
      - "kiss"
      - "hand shake"
      - "fight/hit (person)"
      - "push (person)"
      - "give/serve to"
      - "take from"
      - "dance with"
      - "sing to"
      - "martial art"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2

# Instructions
annotation_instructions: |
  ## AVA Atomic Visual Actions Task

  Your goal is to annotate atomic human actions in video clips from movies.

  ### Step 1: Track People
  - Draw bounding boxes around each person visible in the frame
  - Track the same person across frames with consistent IDs
  - Only annotate people who are at least partially visible

  ### Step 2: Label Actions (for each person)
  For each tracked person, select ALL actions they are performing:

  **Pose Actions:** Body position/movement
  - stand, sit, lie, walk, run, jump, dance, etc.

  **Person-Object Interactions:** Actions with objects
  - eating, drinking, using phone, driving, etc.

  **Person-Person Interactions:** Actions with other people
  - talking, hugging, fighting, shaking hands, etc.

  ### Important Notes:
  - Actions are annotated per 1-second intervals
  - One person can have MULTIPLE actions simultaneously
    (e.g., "sit" + "talk to" + "use phone")
  - Focus on the KEYFRAME (middle frame of the 1-second clip)
  - Only annotate clearly visible actions

  ### Action Definitions:
  - **Atomic**: Simple, basic actions (not complex activities)
  - **Visual**: Action must be visually apparent in the frame
  - **Current**: Action happening at this moment, not before/after

Sample Datasample-data.json

[
  {
    "id": "clip_001",
    "video_url": "https://example.com/videos/movie_scene_001.mp4",
    "movie_id": "movie_001",
    "timestamp_start": 120,
    "timestamp_end": 121,
    "keyframe_timestamp": 120.5,
    "scene_description": "Indoor office scene - two people having a conversation",
    "num_expected_persons": 2
  },
  {
    "id": "clip_002",
    "video_url": "https://example.com/videos/movie_scene_002.mp4",
    "movie_id": "movie_001",
    "timestamp_start": 245,
    "timestamp_end": 246,
    "keyframe_timestamp": 245.5,
    "scene_description": "Restaurant scene - group dining",
    "num_expected_persons": 4
  }
]

// ... and 3 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/action-recognition/ava-atomic-actions
potato start config.yaml

Details

Annotation Types

multiselectvideo_annotation

Domain

Computer VisionVideo Understanding

Use Cases

Action RecognitionPerson DetectionActivity Understanding

Tags

videoactionsbounding-boxspatio-temporalavamovies

Found an issue or want to improve this design?

Open an Issue