Skip to content
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

Audio Annotation

A complete guide to audio annotation in Potato, classification, tagging, sound event detection on the waveform, transcription, quality (MOS) ratings, emotion, and speaker diarization.

Audio annotation covers everything from labeling a whole clip ("is this speech or music?") to marking the exact moment a sound occurs on the waveform. Potato displays an interactive waveform with playback and time markers, so the same tool handles classification, tagging, time-aligned event detection, transcription, quality ratings, and speaker work. For the feature reference see Audio Annotation.

This guide maps each common audio task to a Potato setup and a runnable showcase design.

Clip-level classification

Label the whole clip with one category. This covers acoustic scene classification, environmental sound classification, keyword spotting, and respiratory sound classification.

yaml
annotation_schemes:
  - annotation_type: radio
    name: scene
    description: "What environment was this recorded in?"
    labels: [Street, Park, Office, Home, Vehicle]

Multi-label tagging

When several sounds or tags apply at once, as in music tagging and AudioSet-style event classification, use multiselect.

yaml
annotation_schemes:
  - annotation_type: multiselect
    name: tags
    description: "Select every instrument you can hear."
    labels: [Guitar, Drums, Piano, Vocals, Bass, Synth]

Sound event detection, spans on the waveform

To mark when a sound starts and ends, use a span over the audio timeline. This is sound event detection, the audio version of span annotation.

yaml
annotation_schemes:
  - annotation_type: span
    name: events
    description: "Mark the start and end of each sound event and label it."
    labels: [Speech, Music, Dog bark, Siren, Silence]

Transcription

For audio transcription, pair playback with a free-text field. Annotators can scrub the waveform while they type.

yaml
annotation_schemes:
  - annotation_type: text
    name: transcript
    description: "Type what is said in this clip."

Quality ratings: MOS and intelligibility

Subjective audio quality is measured with a mean opinion score, a 1–5 Likert rating averaged across listeners. This covers speech quality (MOS) and speech intelligibility.

yaml
annotation_schemes:
  - annotation_type: likert
    name: mos
    description: "Rate the overall quality of this audio."
    size: 5
    min_label: "Bad"
    max_label: "Excellent"

See Rating Scales for scale-design tips.

Emotion and sentiment

Speech emotion recognition and audio sentiment analysis combine a category (the emotion) with dimensional ratings (arousal, valence) using radio plus slider or likert.

Speaker diarization

Speaker diarization answers "who spoke when". Annotators mark time spans and link each to a speaker, which is span annotation plus a linking step.

Practical tips

  • Keep clips short enough to judge in one or two plays; long clips lower agreement.
  • For event detection, agree on how precise boundaries must be and measure agreement at the span level, see Inter-Annotator Agreement.
  • Normalize loudness across clips so quality ratings aren't driven by volume.

Further reading