Video Annotation
How to annotate video in Potato, frame-by-frame navigation, temporal segment labeling, per-frame classification, and tracking objects across frames.
Video annotation adds a time axis to image work. The same clip can be labeled as a whole, segmented into time intervals ("the goal happens from 0:12 to 0:15"), or annotated frame by frame. Potato provides frame navigation and temporal controls so annotators can move through a clip precisely.
Video tasks are central to activity recognition and object tracking.
Clip-level classification
The simplest task: one label for the whole clip.
annotation_schemes:
- annotation_type: radio
name: action
description: "What is the main action in this clip?"
labels: [Walking, Running, Sitting, Jumping, Other]Temporal segments, when something happens
To mark intervals on the timeline, use a span over the video's time axis, just like sound event detection does for audio.
annotation_schemes:
- annotation_type: span
name: events
description: "Mark the start and end of each event and label it."
labels: [Goal, Foul, Substitution, Replay]Per-frame annotation and tracking
For frame-level work, classifying individual frames or tracking an object across frames, annotators step through the video and annotate at each frame. Decide a sampling rate (every frame, every Nth frame, or keyframes only); labeling every frame is expensive, so most projects subsample.
Keeping video annotation consistent
- Boundary precision. Agree how exact segment start/end must be; frame-level precision is costly.
- Occlusion and exit. Write rules for when a tracked object is hidden or leaves the frame.
- Workload. Video is the most time-consuming modality, pilot to estimate cost before scaling, and consider LLM/vision pre-annotation to seed labels.
Further reading
- Audio Annotation, the same temporal-span ideas
- Image Annotation
- Span Annotation