Audio Annotation
A complete guide to audio annotation in Potato, classification, tagging, sound event detection on the waveform, transcription, quality (MOS) ratings, emotion, and speaker diarization.
Audio annotation covers everything from labeling a whole clip ("is this speech or music?") to marking the exact moment a sound occurs on the waveform. Potato displays an interactive waveform with playback and time markers, so the same tool handles classification, tagging, time-aligned event detection, transcription, quality ratings, and speaker work. For the feature reference see Audio Annotation.
This guide maps each common audio task to a Potato setup and a runnable showcase design.
Clip-level classification
Label the whole clip with one category. This covers acoustic scene classification, environmental sound classification, keyword spotting, and respiratory sound classification.
annotation_schemes:
- annotation_type: radio
name: scene
description: "What environment was this recorded in?"
labels: [Street, Park, Office, Home, Vehicle]Multi-label tagging
When several sounds or tags apply at once, as in music tagging and AudioSet-style event classification, use multiselect.
annotation_schemes:
- annotation_type: multiselect
name: tags
description: "Select every instrument you can hear."
labels: [Guitar, Drums, Piano, Vocals, Bass, Synth]Sound event detection, spans on the waveform
To mark when a sound starts and ends, use a span over the audio timeline. This is sound event detection, the audio version of span annotation.
annotation_schemes:
- annotation_type: span
name: events
description: "Mark the start and end of each sound event and label it."
labels: [Speech, Music, Dog bark, Siren, Silence]Transcription
For audio transcription, pair playback with a free-text field. Annotators can scrub the waveform while they type.
annotation_schemes:
- annotation_type: text
name: transcript
description: "Type what is said in this clip."Quality ratings: MOS and intelligibility
Subjective audio quality is measured with a mean opinion score, a 1–5 Likert rating averaged across listeners. This covers speech quality (MOS) and speech intelligibility.
annotation_schemes:
- annotation_type: likert
name: mos
description: "Rate the overall quality of this audio."
size: 5
min_label: "Bad"
max_label: "Excellent"See Rating Scales for scale-design tips.
Emotion and sentiment
Speech emotion recognition and audio sentiment analysis combine a category (the emotion) with dimensional ratings (arousal, valence) using radio plus slider or likert.
Speaker diarization
Speaker diarization answers "who spoke when". Annotators mark time spans and link each to a speaker, which is span annotation plus a linking step.
Practical tips
- Keep clips short enough to judge in one or two plays; long clips lower agreement.
- For event detection, agree on how precise boundaries must be and measure agreement at the span level, see Inter-Annotator Agreement.
- Normalize loudness across clips so quality ratings aren't driven by volume.