Audio Annotation
Segment audio files in Potato and assign labels to time regions. Displays an interactive waveform with playback controls, speed adjustment, and time-boundary marking.
Audio Annotation
Potato's audio annotation tool enables annotators to segment audio files and assign labels to time regions through a waveform-based interface.
Features
- Waveform visualization
- Time-based segment creation
- Label assignment to segments
- Playback controls with variable speed
- Zoom and scroll navigation
- Keyboard shortcuts
- Server-side waveform caching
Basic Configuration
yaml
annotation_schemes:
- name: "speakers"
description: "Mark when each speaker is talking"
annotation_type: "audio_annotation"
labels:
- name: "Speaker 1"
color: "#3B82F6"
- name: "Speaker 2"
color: "#10B981"Configuration Options
| Field | Type | Default | Description |
|---|---|---|---|
name | string | Required | Unique identifier for the annotation |
description | string | Required | Instructions shown to annotators |
annotation_type | string | Required | Must be "audio_annotation" |
mode | string | "label" | Annotation mode: "label", "questions", or "both" |
labels | list | Conditional | Required for label or both modes |
segment_schemes | list | Conditional | Required for questions or both modes |
min_segments | integer | 0 | Minimum segments required |
max_segments | integer | null | Maximum segments allowed (null = unlimited) |
zoom_enabled | boolean | true | Enable zoom controls |
playback_rate_control | boolean | false | Show playback speed selector |
Label Configuration
yaml
labels:
- name: "speech"
color: "#3B82F6"
key_value: "1"
- name: "music"
color: "#10B981"
key_value: "2"
- name: "silence"
color: "#64748B"
key_value: "3"Annotation Modes
Label Mode (Default)
Segments receive category labels:
yaml
annotation_schemes:
- name: "emotion"
description: "Label the emotion in each segment"
annotation_type: "audio_annotation"
mode: "label"
labels:
- name: "happy"
color: "#22C55E"
- name: "sad"
color: "#3B82F6"
- name: "angry"
color: "#EF4444"
- name: "neutral"
color: "#64748B"Questions Mode
Each segment answers dedicated questions:
yaml
annotation_schemes:
- name: "transcription"
description: "Transcribe each segment"
annotation_type: "audio_annotation"
mode: "questions"
segment_schemes:
- name: "transcript"
annotation_type: "text"
description: "Enter the transcription"
- name: "confidence"
annotation_type: "likert"
description: "How confident are you?"
size: 5Both Mode
Combines labeling with per-segment questionnaires:
yaml
annotation_schemes:
- name: "detailed_diarization"
description: "Label speakers and add notes"
annotation_type: "audio_annotation"
mode: "both"
labels:
- name: "Speaker A"
color: "#3B82F6"
- name: "Speaker B"
color: "#10B981"
segment_schemes:
- name: "notes"
annotation_type: "text"
description: "Any notes about this segment?"Global Audio Configuration
Configure waveform handling in your config file:
yaml
audio_annotation:
waveform_cache_dir: "waveform_cache/"
waveform_look_ahead: 5
waveform_cache_max_size: 1000
client_fallback_max_duration: 1800| Field | Description |
|---|---|
waveform_cache_dir | Directory for cached waveform data |
waveform_look_ahead | Number of upcoming instances to pre-compute |
waveform_cache_max_size | Maximum number of cached waveform files |
client_fallback_max_duration | Max seconds for browser-side waveform generation (default: 1800) |
Examples
Speaker Diarization
yaml
annotation_schemes:
- name: "diarization"
description: "Identify who is speaking at each moment"
annotation_type: "audio_annotation"
mode: "label"
labels:
- name: "Interviewer"
color: "#8B5CF6"
key_value: "1"
- name: "Guest"
color: "#EC4899"
key_value: "2"
- name: "Overlap"
color: "#F59E0B"
key_value: "3"
zoom_enabled: true
playback_rate_control: trueSound Event Detection
yaml
annotation_schemes:
- name: "sound_events"
description: "Mark all sound events"
annotation_type: "audio_annotation"
labels:
- name: "speech"
color: "#3B82F6"
- name: "music"
color: "#10B981"
- name: "applause"
color: "#F59E0B"
- name: "laughter"
color: "#EC4899"
- name: "silence"
color: "#64748B"
min_segments: 1Transcription Review
yaml
annotation_schemes:
- name: "transcription_review"
description: "Review and correct the transcription for each segment"
annotation_type: "audio_annotation"
mode: "questions"
segment_schemes:
- name: "transcript"
annotation_type: "text"
description: "Enter or correct the transcription"
multiline: true
- name: "quality"
annotation_type: "radio"
description: "Audio quality"
labels:
- "Clear"
- "Noisy"
- "Unintelligible"Keyboard Shortcuts
| Key | Action |
|---|---|
Space | Play/pause |
← / → | Seek backward/forward |
[ | Mark segment start |
] | Mark segment end |
Enter | Create segment |
Delete | Remove selected segment |
1-9 | Select label |
+ / - | Zoom in/out |
0 | Fit view |
Data Format
Input Data
Your data file should include audio file paths or URLs:
json
[
{
"id": "audio_001",
"audio_url": "https://example.com/audio/recording1.mp3"
},
{
"id": "audio_002",
"audio_url": "/data/audio/recording2.wav"
}
]Configure the audio field:
yaml
item_properties:
id_key: id
text_key: audio_urlOutput Format
json
{
"id": "audio_001",
"annotations": {
"diarization": [
{
"start": 0.0,
"end": 5.5,
"label": "Interviewer"
},
{
"start": 5.5,
"end": 12.3,
"label": "Guest"
},
{
"start": 12.3,
"end": 14.0,
"label": "Overlap"
}
]
}
}For questions mode, segments include nested responses:
json
{
"start": 0.0,
"end": 5.5,
"transcript": "Hello and welcome to the show.",
"quality": "Clear"
}Supported Audio Formats
- MP3 (recommended)
- WAV
- OGG
- M4A
Best Practices
- Pre-cache waveforms - Use server-side caching for large datasets
- Enable playback control - Variable speed helps with precise segmentation
- Use keyboard shortcuts - Much faster than clicking
- Define clear boundaries - Specify what constitutes segment start/end
- Choose appropriate mode - Use "label" for classification, "questions" for detailed annotation
- Set segment limits - Use
min_segmentsto ensure coverage