Skip to content
此页面尚未提供您所选语言的版本,当前显示英文版本。

Video Annotation

How to annotate video in Potato, frame-by-frame navigation, temporal segment labeling, per-frame classification, and tracking objects across frames.

Video annotation adds a time axis to image work. The same clip can be labeled as a whole, segmented into time intervals ("the goal happens from 0:12 to 0:15"), or annotated frame by frame. Potato provides frame navigation and temporal controls so annotators can move through a clip precisely.

Video tasks are central to activity recognition and object tracking.

Clip-level classification

The simplest task: one label for the whole clip.

yaml
annotation_schemes:
  - annotation_type: radio
    name: action
    description: "What is the main action in this clip?"
    labels: [Walking, Running, Sitting, Jumping, Other]

Temporal segments, when something happens

To mark intervals on the timeline, use a span over the video's time axis, just like sound event detection does for audio.

yaml
annotation_schemes:
  - annotation_type: span
    name: events
    description: "Mark the start and end of each event and label it."
    labels: [Goal, Foul, Substitution, Replay]

Per-frame annotation and tracking

For frame-level work, classifying individual frames or tracking an object across frames, annotators step through the video and annotate at each frame. Decide a sampling rate (every frame, every Nth frame, or keyframes only); labeling every frame is expensive, so most projects subsample.

Keeping video annotation consistent

  • Boundary precision. Agree how exact segment start/end must be; frame-level precision is costly.
  • Occlusion and exit. Write rules for when a tracked object is hidden or leaves the frame.
  • Workload. Video is the most time-consuming modality, pilot to estimate cost before scaling, and consider LLM/vision pre-annotation to seed labels.

Further reading