Skip to content
Docs/Annotation Types

Audio Annotation

Segment audio files and assign labels to time regions with waveform visualization.

Audio Annotation

Potato's audio annotation tool enables annotators to segment audio files and assign labels to time regions through a waveform-based interface.

Features

  • Waveform visualization
  • Time-based segment creation
  • Label assignment to segments
  • Playback controls with variable speed
  • Zoom and scroll navigation
  • Keyboard shortcuts
  • Server-side waveform caching

Basic Configuration

yaml
annotation_schemes:
  - name: "speakers"
    description: "Mark when each speaker is talking"
    annotation_type: "audio_annotation"
    labels:
      - name: "Speaker 1"
        color: "#3B82F6"
      - name: "Speaker 2"
        color: "#10B981"

Configuration Options

FieldTypeDefaultDescription
namestringRequiredUnique identifier for the annotation
descriptionstringRequiredInstructions shown to annotators
annotation_typestringRequiredMust be "audio_annotation"
modestring"label"Annotation mode: "label", "questions", or "both"
labelslistConditionalRequired for label or both modes
segment_schemeslistConditionalRequired for questions or both modes
min_segmentsinteger0Minimum segments required
max_segmentsintegernullMaximum segments allowed (null = unlimited)
zoom_enabledbooleantrueEnable zoom controls
playback_rate_controlbooleanfalseShow playback speed selector

Label Configuration

yaml
labels:
  - name: "speech"
    color: "#3B82F6"
    key_value: "1"
  - name: "music"
    color: "#10B981"
    key_value: "2"
  - name: "silence"
    color: "#64748B"
    key_value: "3"

Annotation Modes

Label Mode (Default)

Segments receive category labels:

yaml
annotation_schemes:
  - name: "emotion"
    description: "Label the emotion in each segment"
    annotation_type: "audio_annotation"
    mode: "label"
    labels:
      - name: "happy"
        color: "#22C55E"
      - name: "sad"
        color: "#3B82F6"
      - name: "angry"
        color: "#EF4444"
      - name: "neutral"
        color: "#64748B"

Questions Mode

Each segment answers dedicated questions:

yaml
annotation_schemes:
  - name: "transcription"
    description: "Transcribe each segment"
    annotation_type: "audio_annotation"
    mode: "questions"
    segment_schemes:
      - name: "transcript"
        annotation_type: "text"
        description: "Enter the transcription"
      - name: "confidence"
        annotation_type: "likert"
        description: "How confident are you?"
        size: 5

Both Mode

Combines labeling with per-segment questionnaires:

yaml
annotation_schemes:
  - name: "detailed_diarization"
    description: "Label speakers and add notes"
    annotation_type: "audio_annotation"
    mode: "both"
    labels:
      - name: "Speaker A"
        color: "#3B82F6"
      - name: "Speaker B"
        color: "#10B981"
    segment_schemes:
      - name: "notes"
        annotation_type: "text"
        description: "Any notes about this segment?"

Global Audio Configuration

Configure waveform handling in your config file:

yaml
audio_annotation:
  waveform_cache_dir: "waveform_cache/"
  waveform_look_ahead: 5
  waveform_cache_max_size: 1000
  client_fallback_max_duration: 1800
FieldDescription
waveform_cache_dirDirectory for cached waveform data
waveform_look_aheadNumber of upcoming instances to pre-compute
waveform_cache_max_sizeMaximum number of cached waveform files
client_fallback_max_durationMax seconds for browser-side waveform generation (default: 1800)

Examples

Speaker Diarization

yaml
annotation_schemes:
  - name: "diarization"
    description: "Identify who is speaking at each moment"
    annotation_type: "audio_annotation"
    mode: "label"
    labels:
      - name: "Interviewer"
        color: "#8B5CF6"
        key_value: "1"
      - name: "Guest"
        color: "#EC4899"
        key_value: "2"
      - name: "Overlap"
        color: "#F59E0B"
        key_value: "3"
    zoom_enabled: true
    playback_rate_control: true

Sound Event Detection

yaml
annotation_schemes:
  - name: "sound_events"
    description: "Mark all sound events"
    annotation_type: "audio_annotation"
    labels:
      - name: "speech"
        color: "#3B82F6"
      - name: "music"
        color: "#10B981"
      - name: "applause"
        color: "#F59E0B"
      - name: "laughter"
        color: "#EC4899"
      - name: "silence"
        color: "#64748B"
    min_segments: 1

Transcription Review

yaml
annotation_schemes:
  - name: "transcription_review"
    description: "Review and correct the transcription for each segment"
    annotation_type: "audio_annotation"
    mode: "questions"
    segment_schemes:
      - name: "transcript"
        annotation_type: "text"
        description: "Enter or correct the transcription"
        textarea: true
      - name: "quality"
        annotation_type: "radio"
        description: "Audio quality"
        labels:
          - "Clear"
          - "Noisy"
          - "Unintelligible"

Keyboard Shortcuts

KeyAction
SpacePlay/pause
/ Seek backward/forward
[Mark segment start
]Mark segment end
EnterCreate segment
DeleteRemove selected segment
1-9Select label
+ / -Zoom in/out
0Fit view

Data Format

Input Data

Your data file should include audio file paths or URLs:

json
[
  {
    "id": "audio_001",
    "audio_url": "https://example.com/audio/recording1.mp3"
  },
  {
    "id": "audio_002",
    "audio_url": "/data/audio/recording2.wav"
  }
]

Configure the audio field:

yaml
item_properties:
  id_key: id
  text_key: audio_url

Output Format

json
{
  "id": "audio_001",
  "annotations": {
    "diarization": [
      {
        "start": 0.0,
        "end": 5.5,
        "label": "Interviewer"
      },
      {
        "start": 5.5,
        "end": 12.3,
        "label": "Guest"
      },
      {
        "start": 12.3,
        "end": 14.0,
        "label": "Overlap"
      }
    ]
  }
}

For questions mode, segments include nested responses:

json
{
  "start": 0.0,
  "end": 5.5,
  "transcript": "Hello and welcome to the show.",
  "quality": "Clear"
}

Supported Audio Formats

  • MP3 (recommended)
  • WAV
  • OGG
  • M4A

Best Practices

  1. Pre-cache waveforms - Use server-side caching for large datasets
  2. Enable playback control - Variable speed helps with precise segmentation
  3. Use keyboard shortcuts - Much faster than clicking
  4. Define clear boundaries - Specify what constitutes segment start/end
  5. Choose appropriate mode - Use "label" for classification, "questions" for detailed annotation
  6. Set segment limits - Use min_segments to ensure coverage