ActivityNet Captions Dense Annotation

Dense temporal annotation with natural language descriptions. Annotators segment videos into events and write descriptive captions for each temporal segment.

Configuration Fileconfig.yaml

yaml

# ActivityNet Captions Dense Annotation Configuration
# Based on Krishna et al., ICCV 2017
# Task: Segment videos and write captions for each event

annotation_task_name: "ActivityNet Dense Captioning"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "event_segments"
    description: |
      Mark temporal segments for each distinct event in the video.
      Events should be semantically meaningful and non-overlapping.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "event"
        color: "#3B82F6"
        key_value: "e"
    zoom_enabled: true
    playback_rate_control: true
    frame_stepping: true
    show_timecode: true
    timeline_height: 80

  - name: "event_caption"
    description: |
      Write a natural language description of the event you just marked.
      Be specific about WHO does WHAT. Start with a verb.
    annotation_type: text
    min_length: 10
    max_length: 200
    placeholder: "e.g., 'A man in a red shirt kicks a soccer ball into the goal'"

allow_all_users: true
instances_per_annotator: 30
annotation_per_instance: 2

annotation_instructions: |
  ## Dense Video Captioning Task

  Your goal is to segment the video into events and describe each one.

  ### Step 1: Identify Events
  - Watch the video and identify distinct events
  - Mark the START and END of each event
  - Events should be meaningful actions/happenings

  ### Step 2: Write Captions
  - Describe WHAT happens in each segment
  - Be specific: mention people, objects, actions
  - Start with a verb (e.g., "A woman picks up...")
  - Keep it concise but complete

  ### Caption Guidelines:
  - Describe visible actions, not intentions
  - Include relevant details (clothing, objects, location)
  - Use present tense
  - Don't describe audio unless relevant

  ### Example Captions:
  - "A chef chops vegetables on a cutting board"
  - "Two children run across a playground"
  - "The camera pans across a mountain landscape"

Sample Datasample-data.json

json

[
  {
    "id": "anetcap_001",
    "video_url": "https://example.com/videos/cooking_video.mp4",
    "duration_seconds": 300,
    "category": "Cooking"
  },
  {
    "id": "anetcap_002",
    "video_url": "https://example.com/videos/sports_clip.mp4",
    "duration_seconds": 180,
    "category": "Sports"
  }
]

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/temporal-grounding/activitynet-captions
potato start config.yaml

Details

Annotation Types

video_annotationtext

Domain

Computer VisionVideo UnderstandingNLP

Use Cases

Dense CaptioningVideo DescriptionTemporal Grounding

Related Designs

NExT-QA - Temporal and Causal Video Question Answering

Temporal and causal video question answering requiring reasoning about actions, events, and their relationships over time. Based on the NExT-QA dataset (Xiao et al., CVPR 2021), annotators answer multiple-choice questions about video content with an emphasis on temporal and causal understanding.

radiotext

VSTAR Video-grounded Dialogue

Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.