# Audio Event Detection and Tagging

Source: https://www.potatoannotator.com/blog/audio-event-detection

Audio event detection is about finding specific sounds inside a recording: a dog barking, a siren, a stretch of music, a door slamming. This tutorial covers timestamp-based annotation for training sound recognition models. For the audio configuration options behind it, see the [audio annotation documentation](https://github.com/davidjurgens/potato/blob/master/docs/annotation-types/multimedia/audio_annotation.md).

## Types of Audio Event Annotation

The simplest form is clip-level tagging, where you label a whole clip with the sounds it contains. Temporal detection goes further and marks the start and end of each event. People usually call precise per-event timestamps "strong" labeling and presence-or-absence without timing "weak" labeling. Which one you want depends on what your model needs to learn.

## Clip-Level Sound Tagging

For short clips with single events:

```yaml
annotation_task_name: "Sound Event Classification"

data_files:
  - data/audio_clips.json

item_properties:
  audio_path: audio_path

annotation_schemes:
  - annotation_type: audio_annotation
    audio_display: waveform
    waveform_color: "#10B981"
    progress_color: "#34D399"
    name: sound_class
    description: "What sound is in this clip?"
    labels:
      - Dog bark
      - Car horn
      - Siren
      - Music
      - Speech
      - Footsteps
      - Door knock
      - Glass breaking
      - Gunshot
      - Baby cry
      - Other
      - Silence/noise only
```

## Temporal Sound Event Detection

Mark when events occur:

```yaml
annotation_task_name: "Sound Event Detection"

data_files:
  - data/recordings.json

item_properties:
  audio_path: audio_path

annotation_schemes:
  - annotation_type: audio_annotation
    audio_display: waveform
    height: 150
    waveform_color: "#6366F1"
    progress_color: "#A5B4FC"
    show_timestamps: true
    enable_regions: true
    speed_control: true
    name: events
    description: "Mark all sound events with timestamps"
    labels:
      - name: speech
        color: "#3B82F6"
      - name: music
        color: "#8B5CF6"
      - name: vehicle
        color: "#EF4444"
      - name: animal
        color: "#F59E0B"
      - name: nature
        color: "#10B981"
      - name: mechanical
        color: "#6B7280"
    allow_overlap: true
    min_duration: 0.1
```

## Complete Audio Event Configuration

```yaml
annotation_task_name: "AudioSet-Style Event Detection"

data_files:
  - data/audio_10sec.json

item_properties:
  audio_path: audio_url

annotation_schemes:
  # Temporal event marking with audio playback
  - annotation_type: audio_annotation
    audio_display: waveform
    waveform_color: "#059669"
    progress_color: "#34D399"
    cursor_color: "#F59E0B"
    height: 128
    show_timestamps: true
    time_format: "ss.ms"
    show_duration: true
    speed_control: true
    speed_options: [0.5, 0.75, 1.0, 1.5]
    enable_regions: true
    region_snap: 0.05
    name: sound_events
    description: "Mark all distinct sound events"
    labels:
      # Human sounds
      - name: Speech
        color: "#3B82F6"
        keyboard_shortcut: "1"
        category: human
      - name: Singing
        color: "#8B5CF6"
        keyboard_shortcut: "2"
        category: human
      - name: Laughter
        color: "#EC4899"
        category: human
      - name: Cough/Sneeze
        color: "#F472B6"
        category: human

      # Music
      - name: Music
        color: "#A855F7"
        keyboard_shortcut: "m"
        category: music
      - name: Musical instrument
        color: "#7C3AED"
        category: music

      # Animals
      - name: Dog
        color: "#F59E0B"
        keyboard_shortcut: "d"
        category: animal
      - name: Cat
        color: "#FBBF24"
        category: animal
      - name: Bird
        color: "#FCD34D"
        category: animal

      # Vehicles
      - name: Car
        color: "#EF4444"
        keyboard_shortcut: "c"
        category: vehicle
      - name: Motorcycle
        color: "#DC2626"
        category: vehicle
      - name: Siren
        color: "#B91C1C"
        category: vehicle
      - name: Aircraft
        color: "#991B1B"
        category: vehicle

      # Environment
      - name: Rain
        color: "#06B6D4"
        category: nature
      - name: Thunder
        color: "#0891B2"
        category: nature
      - name: Wind
        color: "#0E7490"
        category: nature
      - name: Water
        color: "#0D9488"
        category: nature

      # Domestic
      - name: Door
        color: "#84CC16"
        category: domestic
      - name: Alarm
        color: "#65A30D"
        category: domestic
      - name: Appliance
        color: "#4D7C0F"
        category: domestic

      # Other
      - name: Noise/Unknown
        color: "#6B7280"
        keyboard_shortcut: "n"
        category: other

    allow_overlap: true
    min_duration: 0.1
    show_labels_on_waveform: true

    # Segment attributes
    segment_attributes:
      - name: confidence
        type: radio
        options: [Clear, Moderate, Faint]
      - name: foreground
        type: checkbox
        description: "Is this the main/foreground sound?"

  # Clip-level tags (weak labels)
  - annotation_type: multiselect
    name: clip_tags
    description: "What sounds are present anywhere in this clip?"
    labels:
      - Speech
      - Music
      - Vehicle sounds
      - Animal sounds
      - Nature sounds
      - Domestic sounds
      - Silence
    min_selections: 1

  # Audio quality
  - annotation_type: radio
    name: quality
    description: "Recording quality"
    labels:
      - Clean (clear sounds)
      - Moderate noise
      - Very noisy
      - Distorted/clipped

annotation_guidelines:
  title: "Sound Event Detection Guide"
  content: |
    ## Your Task
    Mark the START and END times of each distinct sound event.

    ## Event Detection Rules
    - Mark sounds that are clearly audible
    - Include overlapping sounds (use multiple labels)
    - Short sounds (<100ms) may be a single point

    ## Segment Boundaries
    - Start: When sound becomes audible
    - End: When sound fades or stops

    ## Confidence Levels
    - Clear: Easily identifiable
    - Moderate: Reasonably sure
    - Faint: Background, hard to identify

    ## Foreground vs Background
    - Foreground: Main focus of audio
    - Background: Ambient sounds

```

## Output Format

```json
{
  "id": "clip_001",
  "audio_url": "/audio/street_scene.wav",
  "duration": 10.0,
  "annotations": {
    "sound_events": [
      {
        "label": "Speech",
        "start": 0.5,
        "end": 3.2,
        "attributes": {
          "confidence": "Clear",
          "foreground": true
        }
      },
      {
        "label": "Car",
        "start": 1.8,
        "end": 4.5,
        "attributes": {
          "confidence": "Moderate",
          "foreground": false
        }
      },
      {
        "label": "Dog",
        "start": 6.1,
        "end": 6.8,
        "attributes": {
          "confidence": "Clear",
          "foreground": true
        }
      }
    ],
    "clip_tags": ["Speech", "Vehicle sounds", "Animal sounds"],
    "quality": "Moderate noise"
  }
}
```

## Pre-annotation with Detector

You can seed the interface with model predictions so annotators correct rather than start from scratch:

```yaml
pre_annotation:
  enabled: true
  field: detected_events
  show_confidence: true
  confidence_threshold: 0.3
  allow_modification: true
```

## Tips for Audio Event Annotation

Good headphones make a real difference here, since a lot of events are faint, and a quiet room helps too. Most annotators work in two passes: one to spot the events, a second to tighten the timestamps. Dropping playback to 0.5x makes boundaries much easier to place. Decide up front what counts as "audible" so everyone draws the line in the same place.

## Next Steps

- Add [music classification](/blog/music-genre-classification) for music content
- Learn [speaker diarization](/blog/speaker-diarization-annotation) for speech
- Set up [quality control](/blog/quality-control-strategies) for event detection

---

*Full audio documentation at [/docs/features/audio-annotation](/docs/features/audio-annotation).*
