Audio event detection is about finding specific sounds inside a recording: a dog barking, a siren, a stretch of music, a door slamming. This tutorial covers timestamp-based annotation for training sound recognition models. For the audio configuration options behind it, see the audio annotation documentation.

Types of Audio Event Annotation

The simplest form is clip-level tagging, where you label a whole clip with the sounds it contains. Temporal detection goes further and marks the start and end of each event. People usually call precise per-event timestamps "strong" labeling and presence-or-absence without timing "weak" labeling. Which one you want depends on what your model needs to learn.

Clip-Level Sound Tagging

For short clips with single events:

yaml

annotation_task_name: "Sound Event Classification"
 
data_files:
  - data/audio_clips.json
 
item_properties:
  audio_path: audio_path
 
annotation_schemes:
  - annotation_type: audio_annotation
    audio_display: waveform
    waveform_color: "#10B981"
    progress_color: "#34D399"
    name: sound_class
    description: "What sound is in this clip?"
    labels:
      - Dog bark
      - Car horn
      - Siren
      - Music
      - Speech
      - Footsteps
      - Door knock
      - Glass breaking
      - Gunshot
      - Baby cry
      - Other
      - Silence/noise only

Temporal Sound Event Detection

Mark when events occur:

yaml

annotation_task_name: "Sound Event Detection"
 
data_files:
  - data/recordings.json
 
item_properties:
  audio_path: audio_path
 
annotation_schemes:
  - annotation_type: audio_annotation
    audio_display: waveform
    height: 150
    waveform_color: "#6366F1"
    progress_color: "#A5B4FC"
    show_timestamps: true
    enable_regions: true
    speed_control: true
    name: events
    description: "Mark all sound events with timestamps"
    labels:
      - name: speech
        color: "#3B82F6"
      - name: music
        color: "#8B5CF6"
      - name: vehicle
        color: "#EF4444"
      - name: animal
        color: "#F59E0B"
      - name: nature
        color: "#10B981"
      - name: mechanical
        color: "#6B7280"
    allow_overlap: true
    min_duration: 0.1

Complete Audio Event Configuration

yaml

annotation_task_name: "AudioSet-Style Event Detection"
 
data_files:
  - data/audio_10sec.json
 
item_properties:
  audio_path: audio_url
 
annotation_schemes:
  # Temporal event marking with audio playback
  - annotation_type: audio_annotation
    audio_display: waveform
    waveform_color: "#059669"
    progress_color: "#34D399"
    cursor_color: "#F59E0B"
    height: 128
    show_timestamps: true
    time_format: "ss.ms"
    show_duration: true
    speed_control: true
    speed_options: [0.5, 0.75, 1.0, 1.5]
    enable_regions: true
    region_snap: 0.05
    name: sound_events
    description: "Mark all distinct sound events"
    labels:
      # Human sounds
      - name: Speech
        color: "#3B82F6"
        keyboard_shortcut: "1"
        category: human
      - name: Singing
        color: "#8B5CF6"
        keyboard_shortcut: "2"
        category: human
      - name: Laughter
        color: "#EC4899"
        category: human
      - name: Cough/Sneeze
        color: "#F472B6"
        category: human
 
      # Music
      - name: Music
        color: "#A855F7"
        keyboard_shortcut: "m"
        category: music
      - name: Musical instrument
        color: "#7C3AED"
        category: music
 
      # Animals
      - name: Dog
        color: "#F59E0B"
        keyboard_shortcut: "d"
        category: animal
      - name: Cat
        color: "#FBBF24"
        category: animal
      - name: Bird
        color: "#FCD34D"
        category: animal
 
      # Vehicles
      - name: Car
        color: "#EF4444"
        keyboard_shortcut: "c"
        category: vehicle
      - name: Motorcycle
        color: "#DC2626"
        category: vehicle
      - name: Siren
        color: "#B91C1C"
        category: vehicle
      - name: Aircraft
        color: "#991B1B"
        category: vehicle
 
      # Environment
      - name: Rain
        color: "#06B6D4"
        category: nature
      - name: Thunder
        color: "#0891B2"
        category: nature
      - name: Wind
        color: "#0E7490"
        category: nature
      - name: Water
        color: "#0D9488"
        category: nature
 
      # Domestic
      - name: Door
        color: "#84CC16"
        category: domestic
      - name: Alarm
        color: "#65A30D"
        category: domestic
      - name: Appliance
        color: "#4D7C0F"
        category: domestic
 
      # Other
      - name: Noise/Unknown
        color: "#6B7280"
        keyboard_shortcut: "n"
        category: other
 
    allow_overlap: true
    min_duration: 0.1
    show_labels_on_waveform: true
 
    # Segment attributes
    segment_attributes:
      - name: confidence
        type: radio
        options: [Clear, Moderate, Faint]
      - name: foreground
        type: checkbox
        description: "Is this the main/foreground sound?"
 
  # Clip-level tags (weak labels)
  - annotation_type: multiselect
    name: clip_tags
    description: "What sounds are present anywhere in this clip?"
    labels:
      - Speech
      - Music
      - Vehicle sounds
      - Animal sounds
      - Nature sounds
      - Domestic sounds
      - Silence
    min_selections: 1
 
  # Audio quality
  - annotation_type: radio
    name: quality
    description: "Recording quality"
    labels:
      - Clean (clear sounds)
      - Moderate noise
      - Very noisy
      - Distorted/clipped
 
annotation_guidelines:
  title: "Sound Event Detection Guide"
  content: |
    ## Your Task
    Mark the START and END times of each distinct sound event.
 
    ## Event Detection Rules
    - Mark sounds that are clearly audible
    - Include overlapping sounds (use multiple labels)
    - Short sounds (<100ms) may be a single point
 
    ## Segment Boundaries
    - Start: When sound becomes audible
    - End: When sound fades or stops
 
    ## Confidence Levels
    - Clear: Easily identifiable
    - Moderate: Reasonably sure
    - Faint: Background, hard to identify
 
    ## Foreground vs Background
    - Foreground: Main focus of audio
    - Background: Ambient sounds

Output Format

json

{
  "id": "clip_001",
  "audio_url": "/audio/street_scene.wav",
  "duration": 10.0,
  "annotations": {
    "sound_events": [
      {
        "label": "Speech",
        "start": 0.5,
        "end": 3.2,
        "attributes": {
          "confidence": "Clear",
          "foreground": true
        }
      },
      {
        "label": "Car",
        "start": 1.8,
        "end": 4.5,
        "attributes": {
          "confidence": "Moderate",
          "foreground": false
        }
      },
      {
        "label": "Dog",
        "start": 6.1,
        "end": 6.8,
        "attributes": {
          "confidence": "Clear",
          "foreground": true
        }
      }
    ],
    "clip_tags": ["Speech", "Vehicle sounds", "Animal sounds"],
    "quality": "Moderate noise"
  }
}

Pre-annotation with Detector

You can seed the interface with model predictions so annotators correct rather than start from scratch:

yaml

pre_annotation:
  enabled: true
  field: detected_events
  show_confidence: true
  confidence_threshold: 0.3
  allow_modification: true

Tips for Audio Event Annotation

Good headphones make a real difference here, since a lot of events are faint, and a quiet room helps too. Most annotators work in two passes: one to spot the events, a second to tighten the timestamps. Dropping playback to 0.5x makes boundaries much easier to place. Decide up front what counts as "audible" so everyone draws the line in the same place.

Next Steps

Add music classification for music content
Learn speaker diarization for speech
Set up quality control for event detection

Full audio documentation at /docs/features/audio-annotation.