Blog/Tutorials
Tutorials5 min read

Audio Event Detection and Tagging

Set up annotation for detecting specific sounds like speech, music, applause, or environmental noises with timestamp spans.

By Potato Team·

Audio Event Detection and Tagging

Audio event detection identifies specific sounds within recordings - from speech and music to environmental sounds and acoustic events. This tutorial covers timestamp-based annotation for training sound recognition models.

Types of Audio Event Annotation

  1. Clip-level tagging: Label entire audio clips
  2. Temporal detection: Mark start/end times of events
  3. Strong labeling: Precise timestamps for each event
  4. Weak labeling: Presence/absence without timestamps

Clip-Level Sound Tagging

For short clips with single events:

annotation_task_name: "Sound Event Classification"
 
data_files:
  - data/audio_clips.json
 
item_properties:
  audio_path: audio_path
 
annotation_schemes:
  - annotation_type: audio_annotation
    audio_display: waveform
    waveform_color: "#10B981"
    progress_color: "#34D399"
    name: sound_class
    description: "What sound is in this clip?"
    labels:
      - Dog bark
      - Car horn
      - Siren
      - Music
      - Speech
      - Footsteps
      - Door knock
      - Glass breaking
      - Gunshot
      - Baby cry
      - Other
      - Silence/noise only

Temporal Sound Event Detection

Mark when events occur:

annotation_task_name: "Sound Event Detection"
 
data_files:
  - data/recordings.json
 
item_properties:
  audio_path: audio_path
 
annotation_schemes:
  - annotation_type: audio_annotation
    audio_display: waveform
    height: 150
    waveform_color: "#6366F1"
    progress_color: "#A5B4FC"
    show_timestamps: true
    enable_regions: true
    speed_control: true
    name: events
    description: "Mark all sound events with timestamps"
    labels:
      - name: speech
        color: "#3B82F6"
      - name: music
        color: "#8B5CF6"
      - name: vehicle
        color: "#EF4444"
      - name: animal
        color: "#F59E0B"
      - name: nature
        color: "#10B981"
      - name: mechanical
        color: "#6B7280"
    allow_overlap: true
    min_duration: 0.1

Complete Audio Event Configuration

annotation_task_name: "AudioSet-Style Event Detection"
 
data_files:
  - data/audio_10sec.json
 
item_properties:
  audio_path: audio_url
 
annotation_schemes:
  # Temporal event marking with audio playback
  - annotation_type: audio_annotation
    audio_display: waveform
    waveform_color: "#059669"
    progress_color: "#34D399"
    cursor_color: "#F59E0B"
    height: 128
    show_timestamps: true
    time_format: "ss.ms"
    show_duration: true
    speed_control: true
    speed_options: [0.5, 0.75, 1.0, 1.5]
    enable_regions: true
    region_snap: 0.05
    name: sound_events
    description: "Mark all distinct sound events"
    labels:
      # Human sounds
      - name: Speech
        color: "#3B82F6"
        keyboard_shortcut: "1"
        category: human
      - name: Singing
        color: "#8B5CF6"
        keyboard_shortcut: "2"
        category: human
      - name: Laughter
        color: "#EC4899"
        category: human
      - name: Cough/Sneeze
        color: "#F472B6"
        category: human
 
      # Music
      - name: Music
        color: "#A855F7"
        keyboard_shortcut: "m"
        category: music
      - name: Musical instrument
        color: "#7C3AED"
        category: music
 
      # Animals
      - name: Dog
        color: "#F59E0B"
        keyboard_shortcut: "d"
        category: animal
      - name: Cat
        color: "#FBBF24"
        category: animal
      - name: Bird
        color: "#FCD34D"
        category: animal
 
      # Vehicles
      - name: Car
        color: "#EF4444"
        keyboard_shortcut: "c"
        category: vehicle
      - name: Motorcycle
        color: "#DC2626"
        category: vehicle
      - name: Siren
        color: "#B91C1C"
        category: vehicle
      - name: Aircraft
        color: "#991B1B"
        category: vehicle
 
      # Environment
      - name: Rain
        color: "#06B6D4"
        category: nature
      - name: Thunder
        color: "#0891B2"
        category: nature
      - name: Wind
        color: "#0E7490"
        category: nature
      - name: Water
        color: "#0D9488"
        category: nature
 
      # Domestic
      - name: Door
        color: "#84CC16"
        category: domestic
      - name: Alarm
        color: "#65A30D"
        category: domestic
      - name: Appliance
        color: "#4D7C0F"
        category: domestic
 
      # Other
      - name: Noise/Unknown
        color: "#6B7280"
        keyboard_shortcut: "n"
        category: other
 
    allow_overlap: true
    min_duration: 0.1
    show_labels_on_waveform: true
 
    # Segment attributes
    segment_attributes:
      - name: confidence
        type: radio
        options: [Clear, Moderate, Faint]
      - name: foreground
        type: checkbox
        description: "Is this the main/foreground sound?"
 
  # Clip-level tags (weak labels)
  - annotation_type: multiselect
    name: clip_tags
    description: "What sounds are present anywhere in this clip?"
    labels:
      - Speech
      - Music
      - Vehicle sounds
      - Animal sounds
      - Nature sounds
      - Domestic sounds
      - Silence
    min_selections: 1
 
  # Audio quality
  - annotation_type: radio
    name: quality
    description: "Recording quality"
    labels:
      - Clean (clear sounds)
      - Moderate noise
      - Very noisy
      - Distorted/clipped
 
annotation_guidelines:
  title: "Sound Event Detection Guide"
  content: |
    ## Your Task
    Mark the START and END times of each distinct sound event.
 
    ## Event Detection Rules
    - Mark sounds that are clearly audible
    - Include overlapping sounds (use multiple labels)
    - Short sounds (<100ms) may be a single point
 
    ## Segment Boundaries
    - Start: When sound becomes audible
    - End: When sound fades or stops
 
    ## Confidence Levels
    - Clear: Easily identifiable
    - Moderate: Reasonably sure
    - Faint: Background, hard to identify
 
    ## Foreground vs Background
    - Foreground: Main focus of audio
    - Background: Ambient sounds
 

Output Format

{
  "id": "clip_001",
  "audio_url": "/audio/street_scene.wav",
  "duration": 10.0,
  "annotations": {
    "sound_events": [
      {
        "label": "Speech",
        "start": 0.5,
        "end": 3.2,
        "attributes": {
          "confidence": "Clear",
          "foreground": true
        }
      },
      {
        "label": "Car",
        "start": 1.8,
        "end": 4.5,
        "attributes": {
          "confidence": "Moderate",
          "foreground": false
        }
      },
      {
        "label": "Dog",
        "start": 6.1,
        "end": 6.8,
        "attributes": {
          "confidence": "Clear",
          "foreground": true
        }
      }
    ],
    "clip_tags": ["Speech", "Vehicle sounds", "Animal sounds"],
    "quality": "Moderate noise"
  }
}

Pre-annotation with Detector

Use model predictions as starting point:

pre_annotation:
  enabled: true
  field: detected_events
  show_confidence: true
  confidence_threshold: 0.3
  allow_modification: true

Tips for Audio Event Annotation

  1. Good headphones: Essential for detecting subtle sounds
  2. Quiet environment: Background noise affects perception
  3. Multiple passes: First pass identify, second refine timestamps
  4. Slow playback: Use 0.5x for precise boundaries
  5. Consistent criteria: Define "audible" threshold clearly

Next Steps


Full audio documentation at /docs/features/audio-annotation.