# Audio Annotation

Source: https://www.potatoannotator.com/docs/guides/audio-annotation

**Audio annotation covers everything from labeling a whole clip ("is this speech or music?") to marking the exact moment a sound occurs on the waveform.** Potato displays an interactive waveform with playback and time markers, so the same tool handles classification, tagging, time-aligned event detection, transcription, quality ratings, and speaker work. For the feature reference see [Audio Annotation](/docs/features/audio-annotation).

This guide maps each common audio task to a Potato setup and a runnable showcase design.

## Clip-level classification

Label the whole clip with one category. This covers [acoustic scene classification](/showcase/acoustic-scene-classification), [environmental sound classification](/showcase/environmental-sound-classification), [keyword spotting](/showcase/keyword-spotting), and [respiratory sound classification](/showcase/respiratory-sound-classification).

```yaml
annotation_schemes:
  - annotation_type: radio
    name: scene
    description: "What environment was this recorded in?"
    labels: [Street, Park, Office, Home, Vehicle]
```

## Multi-label tagging

When several sounds or tags apply at once, as in [music tagging](/showcase/music-tagging) and [AudioSet-style event classification](/showcase/audioset-event-classification), use `multiselect`.

```yaml
annotation_schemes:
  - annotation_type: multiselect
    name: tags
    description: "Select every instrument you can hear."
    labels: [Guitar, Drums, Piano, Vocals, Bass, Synth]
```

## Sound event detection, spans on the waveform

To mark *when* a sound starts and ends, use a span over the audio timeline. This is [sound event detection](/showcase/sound-event-detection), the audio version of [span annotation](/docs/guides/span-annotation).

```yaml
annotation_schemes:
  - annotation_type: span
    name: events
    description: "Mark the start and end of each sound event and label it."
    labels: [Speech, Music, Dog bark, Siren, Silence]
```

## Transcription

For [audio transcription](/showcase/audio-transcription), pair playback with a free-text field. Annotators can scrub the waveform while they type.

```yaml
annotation_schemes:
  - annotation_type: text
    name: transcript
    description: "Type what is said in this clip."
```

## Quality ratings: MOS and intelligibility

Subjective audio quality is measured with a [mean opinion score](https://en.wikipedia.org/wiki/Mean_opinion_score), a 1–5 Likert rating averaged across listeners. This covers [speech quality (MOS)](/showcase/speech-quality-mos) and [speech intelligibility](/showcase/speech-intelligibility-rating).

```yaml
annotation_schemes:
  - annotation_type: likert
    name: mos
    description: "Rate the overall quality of this audio."
    size: 5
    min_label: "Bad"
    max_label: "Excellent"
```

See [Rating Scales](/docs/guides/rating-scales) for scale-design tips.

## Emotion and sentiment

[Speech emotion recognition](/showcase/speech-emotion-recognition) and [audio sentiment analysis](/showcase/audio-sentiment-analysis) combine a category (the emotion) with dimensional ratings (arousal, valence) using `radio` plus `slider` or `likert`.

## Speaker diarization

[Speaker diarization](/showcase/speaker-diarization) answers "who spoke when". Annotators mark time spans and link each to a speaker, which is span annotation plus a linking step.

## Practical tips

- Keep clips short enough to judge in one or two plays; long clips lower agreement.
- For event detection, agree on how precise boundaries must be and measure agreement at the span level, see [Inter-Annotator Agreement](/docs/guides/inter-annotator-agreement).
- Normalize loudness across clips so quality ratings aren't driven by volume.

## Further reading

- [Audio Annotation feature reference](/docs/features/audio-annotation)
- [Video Annotation](/docs/guides/video-annotation)
- [Span Annotation](/docs/guides/span-annotation)