Audio Annotation
Segment audio files and assign labels to time regions with waveform visualization.
Audio Annotation
Potato's audio annotation tool enables annotators to segment audio files and assign labels to time regions through a waveform-based interface.
Features
- Waveform visualization
- Time-based segment creation
- Label assignment to segments
- Playback controls with variable speed
- Zoom and scroll navigation
- Keyboard shortcuts
- Server-side waveform caching
Basic Configuration
yaml
annotation_schemes:
- name: "speakers"
description: "Mark when each speaker is talking"
annotation_type: "audio_annotation"
labels:
- name: "Speaker 1"
color: "#3B82F6"
- name: "Speaker 2"
color: "#10B981"Configuration Options
| Field | Type | Default | Description |
|---|---|---|---|
name | string | Required | Unique identifier for the annotation |
description | string | Required | Instructions shown to annotators |
annotation_type | string | Required | Must be "audio_annotation" |
mode | string | "label" | Annotation mode: "label", "questions", or "both" |
labels | list | Conditional | Required for label or both modes |
segment_schemes | list | Conditional | Required for questions or both modes |
min_segments | integer | 0 | Minimum segments required |
max_segments | integer | null | Maximum segments allowed (null = unlimited) |
zoom_enabled | boolean | true | Enable zoom controls |
playback_rate_control | boolean | false | Show playback speed selector |
Label Configuration
yaml
labels:
- name: "speech"
color: "#3B82F6"
key_value: "1"
- name: "music"
color: "#10B981"
key_value: "2"
- name: "silence"
color: "#64748B"
key_value: "3"Annotation Modes
Label Mode (Default)
Segments receive category labels:
yaml
annotation_schemes:
- name: "emotion"
description: "Label the emotion in each segment"
annotation_type: "audio_annotation"
mode: "label"
labels:
- name: "happy"
color: "#22C55E"
- name: "sad"
color: "#3B82F6"
- name: "angry"
color: "#EF4444"
- name: "neutral"
color: "#64748B"Questions Mode
Each segment answers dedicated questions:
yaml
annotation_schemes:
- name: "transcription"
description: "Transcribe each segment"
annotation_type: "audio_annotation"
mode: "questions"
segment_schemes:
- name: "transcript"
annotation_type: "text"
description: "Enter the transcription"
- name: "confidence"
annotation_type: "likert"
description: "How confident are you?"
size: 5Both Mode
Combines labeling with per-segment questionnaires:
yaml
annotation_schemes:
- name: "detailed_diarization"
description: "Label speakers and add notes"
annotation_type: "audio_annotation"
mode: "both"
labels:
- name: "Speaker A"
color: "#3B82F6"
- name: "Speaker B"
color: "#10B981"
segment_schemes:
- name: "notes"
annotation_type: "text"
description: "Any notes about this segment?"Global Audio Configuration
Configure waveform handling in your config file:
yaml
audio_annotation:
waveform_cache_dir: "waveform_cache/"
waveform_look_ahead: 5
waveform_cache_max_size: 1000
client_fallback_max_duration: 1800| Field | Description |
|---|---|
waveform_cache_dir | Directory for cached waveform data |
waveform_look_ahead | Number of upcoming instances to pre-compute |
waveform_cache_max_size | Maximum number of cached waveform files |
client_fallback_max_duration | Max seconds for browser-side waveform generation (default: 1800) |
Examples
Speaker Diarization
yaml
annotation_schemes:
- name: "diarization"
description: "Identify who is speaking at each moment"
annotation_type: "audio_annotation"
mode: "label"
labels:
- name: "Interviewer"
color: "#8B5CF6"
key_value: "1"
- name: "Guest"
color: "#EC4899"
key_value: "2"
- name: "Overlap"
color: "#F59E0B"
key_value: "3"
zoom_enabled: true
playback_rate_control: trueSound Event Detection
yaml
annotation_schemes:
- name: "sound_events"
description: "Mark all sound events"
annotation_type: "audio_annotation"
labels:
- name: "speech"
color: "#3B82F6"
- name: "music"
color: "#10B981"
- name: "applause"
color: "#F59E0B"
- name: "laughter"
color: "#EC4899"
- name: "silence"
color: "#64748B"
min_segments: 1Transcription Review
yaml
annotation_schemes:
- name: "transcription_review"
description: "Review and correct the transcription for each segment"
annotation_type: "audio_annotation"
mode: "questions"
segment_schemes:
- name: "transcript"
annotation_type: "text"
description: "Enter or correct the transcription"
textarea: true
- name: "quality"
annotation_type: "radio"
description: "Audio quality"
labels:
- "Clear"
- "Noisy"
- "Unintelligible"Keyboard Shortcuts
| Key | Action |
|---|---|
Space | Play/pause |
← / → | Seek backward/forward |
[ | Mark segment start |
] | Mark segment end |
Enter | Create segment |
Delete | Remove selected segment |
1-9 | Select label |
+ / - | Zoom in/out |
0 | Fit view |
Data Format
Input Data
Your data file should include audio file paths or URLs:
json
[
{
"id": "audio_001",
"audio_url": "https://example.com/audio/recording1.mp3"
},
{
"id": "audio_002",
"audio_url": "/data/audio/recording2.wav"
}
]Configure the audio field:
yaml
item_properties:
id_key: id
text_key: audio_urlOutput Format
json
{
"id": "audio_001",
"annotations": {
"diarization": [
{
"start": 0.0,
"end": 5.5,
"label": "Interviewer"
},
{
"start": 5.5,
"end": 12.3,
"label": "Guest"
},
{
"start": 12.3,
"end": 14.0,
"label": "Overlap"
}
]
}
}For questions mode, segments include nested responses:
json
{
"start": 0.0,
"end": 5.5,
"transcript": "Hello and welcome to the show.",
"quality": "Clear"
}Supported Audio Formats
- MP3 (recommended)
- WAV
- OGG
- M4A
Best Practices
- Pre-cache waveforms - Use server-side caching for large datasets
- Enable playback control - Variable speed helps with precise segmentation
- Use keyboard shortcuts - Much faster than clicking
- Define clear boundaries - Specify what constitutes segment start/end
- Choose appropriate mode - Use "label" for classification, "questions" for detailed annotation
- Set segment limits - Use
min_segmentsto ensure coverage