Audio Annotation
Annotate audio files with waveform visualization and playback controls.
Audio Annotation
Potato 2.0 provides powerful audio annotation with waveform visualization powered by Peaks.js, segment labeling, and comprehensive keyboard shortcuts.
Use Cases
- Speech transcription and review
- Speaker diarization
- Music analysis
- Audio event detection
- Emotion recognition in speech
- Call center quality assurance
Enabling Audio Support
Add an audio_annotation section to your configuration:
annotation_schemes:
- annotation_type: audio
name: audio_segments
description: "Segment and label the audio"
labels:
- Speech
- Music
- Silence
- NoiseOperational Modes
Potato supports three audio annotation modes:
Label Mode
Segment audio and assign category labels to each segment:
annotation_schemes:
- annotation_type: audio
name: speaker_diarization
mode: label
description: "Identify speakers in the audio"
labels:
- Speaker A
- Speaker B
- Overlap
label_colors:
"Speaker A": "#3b82f6"
"Speaker B": "#10b981"
"Overlap": "#f59e0b"Questions Mode
Add per-segment annotation questions:
annotation_schemes:
- annotation_type: audio
name: speech_quality
mode: questions
description: "Evaluate speech segments"
segment_questions:
- name: clarity
type: likert
size: 5
min_label: "Unclear"
max_label: "Very clear"
- name: emotion
type: radio
labels: [Neutral, Happy, Sad, Angry]Both Mode
Combine labeling with per-segment questions:
annotation_schemes:
- annotation_type: audio
name: full_analysis
mode: both
description: "Label and analyze audio segments"
labels:
- Speech
- Music
- Noise
segment_questions:
- name: quality
type: likert
size: 5Configuration Options
Basic Setup
annotation_schemes:
- annotation_type: audio
name: segments
description: "Create audio segments"
labels:
- Label A
- Label B
# Optional constraints
min_segments: 1
max_segments: 50Keyboard Shortcuts
Labels can be assigned using number keys 1-9:
annotation_schemes:
- annotation_type: audio
name: speakers
labels:
- Speaker A # Press 1
- Speaker B # Press 2
- Overlap # Press 3Label Colors
Customize segment colors:
annotation_schemes:
- annotation_type: audio
name: segments
labels:
- Speech
- Music
- Silence
label_colors:
"Speech": "#3b82f6"
"Music": "#10b981"
"Silence": "#6b7280"Waveform Performance
For optimal performance with long audio files, install the BBC audiowaveform tool:
# macOS
brew install audiowaveform
# Ubuntu/Debian
sudo apt-get install audiowaveform
# Or build from source
# https://github.com/bbc/audiowaveformThis enables server-side waveform generation. Without it, client-side generation is used (suitable for files under 30 minutes).
Waveform Caching
Configure caching for better performance:
audio_config:
cache_dir: "audio_cache/"
precompute_depth: 100 # Pre-generate waveforms for first N items
client_fallback_max_duration: 1800 # 30 minutes in secondsData Format
Simple Audio Reference
[
{"id": "1", "audio_path": "audio/recording_001.wav"},
{"id": "2", "audio_path": "audio/recording_002.wav"}
]data_files:
- "data/audio_data.json"
item_properties:
id_key: id
audio_key: audio_pathWith Transcripts
[
{
"id": "1",
"audio_path": "audio/call_001.wav",
"transcript": "Hello, how can I help you today?"
}
]Output Format
Annotations are saved with segment timestamps:
{
"id": "audio_1",
"annotations": {
"segments": [
{
"start": 0.0,
"end": 2.5,
"label": "Speaker A",
"questions": {
"clarity": 4,
"emotion": "Neutral"
}
},
{
"start": 2.5,
"end": 5.2,
"label": "Speaker B"
}
]
}
}Keyboard Shortcuts
Potato provides extensive keyboard shortcuts for efficient annotation:
| Shortcut | Action |
|---|---|
Space | Play/Pause |
[ | Set segment start at current position |
] | Set segment end at current position |
1-9 | Assign label to current segment |
Delete | Remove current segment |
Left Arrow | Seek backward 5 seconds |
Right Arrow | Seek forward 5 seconds |
Up Arrow | Zoom in |
Down Arrow | Zoom out |
Home | Go to start |
End | Go to end |
+ | Increase playback speed |
- | Decrease playback speed |
Example Configurations
Speaker Diarization
task_name: "Speaker Diarization"
task_dir: "."
port: 8000
data_files:
- "data/recordings.json"
item_properties:
id_key: id
audio_key: audio_path
annotation_schemes:
- annotation_type: audio
name: speakers
mode: label
description: "Identify who is speaking"
labels:
- Speaker 1
- Speaker 2
- Speaker 3
- Overlap
- Silence
label_colors:
"Speaker 1": "#3b82f6"
"Speaker 2": "#10b981"
"Speaker 3": "#f59e0b"
"Overlap": "#ef4444"
"Silence": "#6b7280"
min_segments: 1
audio_config:
cache_dir: "audio_cache/"
precompute_depth: 50
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: trueTranscription Review
task_name: "Transcription Quality Review"
task_dir: "."
port: 8000
data_files:
- "data/transcripts.json"
item_properties:
id_key: id
text_key: transcript
audio_key: audio_path
annotation_schemes:
- annotation_type: audio
name: errors
mode: questions
description: "Mark transcription errors"
segment_questions:
- name: error_type
type: radio
labels:
- Missing word
- Wrong word
- Extra word
- Spelling error
- name: severity
type: likert
size: 3
min_label: "Minor"
max_label: "Major"
- annotation_type: radio
name: overall_accuracy
description: "Overall transcript accuracy"
labels:
- Accurate
- Minor errors
- Major errors
- Unusable
output_annotation_dir: "output/"
output_annotation_format: "json"Call Center QA
task_name: "Call Center Quality Assurance"
task_dir: "."
port: 8000
data_files:
- "data/calls.json"
item_properties:
id_key: call_id
audio_key: recording_path
annotation_schemes:
# Segment-level annotation
- annotation_type: audio
name: conversation
mode: both
description: "Segment the conversation"
labels:
- Agent
- Customer
- Hold
- Silence
segment_questions:
- name: sentiment
type: radio
labels: [Positive, Neutral, Negative, Frustrated]
# Call-level assessment
- annotation_type: likert
name: professionalism
description: "Agent professionalism"
size: 5
min_label: "Poor"
max_label: "Excellent"
- annotation_type: likert
name: resolution
description: "Issue resolution"
size: 5
min_label: "Unresolved"
max_label: "Fully resolved"
- annotation_type: multiselect
name: issues
description: "Select any issues observed"
labels:
- Long hold time
- Agent interrupted
- Incorrect information
- Missing greeting
- Unprofessional language
- annotation_type: text
name: notes
description: "Additional observations"
textarea: true
output_annotation_dir: "output/"
output_annotation_format: "json"Supported Audio Formats
- WAV (recommended for best quality)
- MP3
- OGG
- FLAC
- M4A
- WebM
Performance Tips
- Install audiowaveform - Essential for long audio files
- Enable caching - Use
cache_dirto store pre-generated waveforms - Use WAV for quality - Compressed formats may introduce artifacts
- Pre-process audio - Normalize levels, trim unnecessary silence
- Consider file sizes - Large files slow down loading
- Use precompute - Pre-generate waveforms for initial instances
Troubleshooting
Waveform Not Loading
- Check audio file path is correct
- Verify file format is supported
- Install audiowaveform for long files
- Check browser console for errors
Slow Performance
- Install audiowaveform tool
- Enable waveform caching
- Reduce audio file sizes
- Use precompute_depth setting
Segments Not Saving
- Ensure output directory is writable
- Check annotation format configuration
- Verify segment has both start and end times