ActivityNet Captions Dense Annotation
Dense temporal annotation with natural language descriptions. Annotators segment videos into events and write descriptive captions for each temporal segment.
Configuration Fileconfig.yaml
# ActivityNet Captions Dense Annotation Configuration
# Based on Krishna et al., ICCV 2017
# Task: Segment videos and write captions for each event
annotation_task_name: "ActivityNet Dense Captioning"
task_dir: "."
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "video_url"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
- name: "event_segments"
description: |
Mark temporal segments for each distinct event in the video.
Events should be semantically meaningful and non-overlapping.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "event"
color: "#3B82F6"
key_value: "e"
zoom_enabled: true
playback_rate_control: true
frame_stepping: true
show_timecode: true
timeline_height: 80
- name: "event_caption"
description: |
Write a natural language description of the event you just marked.
Be specific about WHO does WHAT. Start with a verb.
annotation_type: text
min_length: 10
max_length: 200
placeholder: "e.g., 'A man in a red shirt kicks a soccer ball into the goal'"
allow_all_users: true
instances_per_annotator: 30
annotation_per_instance: 2
annotation_instructions: |
## Dense Video Captioning Task
Your goal is to segment the video into events and describe each one.
### Step 1: Identify Events
- Watch the video and identify distinct events
- Mark the START and END of each event
- Events should be meaningful actions/happenings
### Step 2: Write Captions
- Describe WHAT happens in each segment
- Be specific: mention people, objects, actions
- Start with a verb (e.g., "A woman picks up...")
- Keep it concise but complete
### Caption Guidelines:
- Describe visible actions, not intentions
- Include relevant details (clothing, objects, location)
- Use present tense
- Don't describe audio unless relevant
### Example Captions:
- "A chef chops vegetables on a cutting board"
- "Two children run across a playground"
- "The camera pans across a mountain landscape"
Sample Datasample-data.json
[
{
"id": "anetcap_001",
"video_url": "https://example.com/videos/cooking_video.mp4",
"duration_seconds": 300,
"category": "Cooking"
},
{
"id": "anetcap_002",
"video_url": "https://example.com/videos/sports_clip.mp4",
"duration_seconds": 180,
"category": "Sports"
}
]Get This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/temporal-grounding/activitynet-captions potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
NExT-QA - Temporal and Causal Video Question Answering
Temporal and causal video question answering requiring reasoning about actions, events, and their relationships over time. Based on the NExT-QA dataset (Xiao et al., CVPR 2021), annotators answer multiple-choice questions about video content with an emphasis on temporal and causal understanding.
VSTAR Video-grounded Dialogue
Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.
DiDeMo Moment Retrieval
Localizing natural language descriptions to specific video moments. Given a text query, annotators identify the corresponding temporal segment in the video.