AVA Atomic Visual Actions
Spatio-temporal action annotation in movie clips. Annotators localize people with bounding boxes and label their atomic actions (pose, person-object, person-person interactions) in 1-second intervals.
Configuration Fileconfig.yaml
# AVA Atomic Visual Actions Configuration
# Based on Gu et al., CVPR 2018
# Task: Localize people and label their atomic actions in 1-second clips
annotation_task_name: "AVA Atomic Visual Actions"
task_dir: "."
# Data configuration
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "video_url"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Annotation schemes
annotation_schemes:
# Track people with bounding boxes
- name: "person_tracking"
description: |
Draw bounding boxes around each person visible in the frame.
Track the same person across frames using consistent IDs.
annotation_type: "video_annotation"
mode: "tracking"
labels:
- name: "person"
color: "#3B82F6"
frame_stepping: true
show_timecode: true
video_fps: 30
# Classify actions for each tracked person
# Simplified subset of AVA's 80 action classes
- name: "pose_actions"
description: "Select all POSE actions the person is performing"
annotation_type: multiselect
labels:
- "stand"
- "sit"
- "lie/sleep"
- "bend/bow"
- "crouch/kneel"
- "walk"
- "run/jog"
- "jump/leap"
- "swim"
- "dance"
keyboard_shortcuts:
stand: "1"
sit: "2"
walk: "3"
run/jog: "4"
- name: "object_interactions"
description: "Select all PERSON-OBJECT interactions"
annotation_type: multiselect
labels:
- "carry/hold object"
- "eat"
- "drink"
- "smoke"
- "read"
- "write"
- "play musical instrument"
- "use phone"
- "work on computer"
- "open (door/container)"
- "close (door/container)"
- "pour"
- "throw"
- "catch"
- "hit (object)"
- "kick (object)"
- "drive"
- "ride (bike/horse)"
- name: "person_interactions"
description: "Select all PERSON-PERSON interactions"
annotation_type: multiselect
labels:
- "talk to"
- "listen to"
- "watch (person)"
- "hug"
- "kiss"
- "hand shake"
- "fight/hit (person)"
- "push (person)"
- "give/serve to"
- "take from"
- "dance with"
- "sing to"
- "martial art"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2
# Instructions
annotation_instructions: |
## AVA Atomic Visual Actions Task
Your goal is to annotate atomic human actions in video clips from movies.
### Step 1: Track People
- Draw bounding boxes around each person visible in the frame
- Track the same person across frames with consistent IDs
- Only annotate people who are at least partially visible
### Step 2: Label Actions (for each person)
For each tracked person, select ALL actions they are performing:
**Pose Actions:** Body position/movement
- stand, sit, lie, walk, run, jump, dance, etc.
**Person-Object Interactions:** Actions with objects
- eating, drinking, using phone, driving, etc.
**Person-Person Interactions:** Actions with other people
- talking, hugging, fighting, shaking hands, etc.
### Important Notes:
- Actions are annotated per 1-second intervals
- One person can have MULTIPLE actions simultaneously
(e.g., "sit" + "talk to" + "use phone")
- Focus on the KEYFRAME (middle frame of the 1-second clip)
- Only annotate clearly visible actions
### Action Definitions:
- **Atomic**: Simple, basic actions (not complex activities)
- **Visual**: Action must be visually apparent in the frame
- **Current**: Action happening at this moment, not before/after
Sample Datasample-data.json
[
{
"id": "clip_001",
"video_url": "https://example.com/videos/movie_scene_001.mp4",
"movie_id": "movie_001",
"timestamp_start": 120,
"timestamp_end": 121,
"keyframe_timestamp": 120.5,
"scene_description": "Indoor office scene - two people having a conversation",
"num_expected_persons": 2
},
{
"id": "clip_002",
"video_url": "https://example.com/videos/movie_scene_002.mp4",
"movie_id": "movie_001",
"timestamp_start": 245,
"timestamp_end": 246,
"keyframe_timestamp": 245.5,
"scene_description": "Restaurant scene - group dining",
"num_expected_persons": 4
}
]
// ... and 3 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/action-recognition/ava-atomic-actions potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
ActivityNet Captions Dense Annotation
Dense temporal annotation with natural language descriptions. Annotators segment videos into events and write descriptive captions for each temporal segment.
ActivityNet Temporal Localization
Temporal activity localization in untrimmed videos. Annotators identify activity instances by marking precise start and end timestamps across 200 activity classes.
Charades Indoor Activity Segmentation
Multi-label temporal activity segmentation in indoor home videos. Annotators identify action instances using compositional verb-object labels (e.g., 'opening door', 'sitting on chair') with precise temporal boundaries.