Docs/Features

Audio Annotation

Annotate audio files with waveform visualization and playback controls.

Audio Annotation

Potato 2.0 provides powerful audio annotation with waveform visualization powered by Peaks.js, segment labeling, and comprehensive keyboard shortcuts.

Use Cases

  • Speech transcription and review
  • Speaker diarization
  • Music analysis
  • Audio event detection
  • Emotion recognition in speech
  • Call center quality assurance

Enabling Audio Support

Add an audio_annotation section to your configuration:

annotation_schemes:
  - annotation_type: audio
    name: audio_segments
    description: "Segment and label the audio"
    labels:
      - Speech
      - Music
      - Silence
      - Noise

Operational Modes

Potato supports three audio annotation modes:

Label Mode

Segment audio and assign category labels to each segment:

annotation_schemes:
  - annotation_type: audio
    name: speaker_diarization
    mode: label
    description: "Identify speakers in the audio"
    labels:
      - Speaker A
      - Speaker B
      - Overlap
    label_colors:
      "Speaker A": "#3b82f6"
      "Speaker B": "#10b981"
      "Overlap": "#f59e0b"

Questions Mode

Add per-segment annotation questions:

annotation_schemes:
  - annotation_type: audio
    name: speech_quality
    mode: questions
    description: "Evaluate speech segments"
    segment_questions:
      - name: clarity
        type: likert
        size: 5
        min_label: "Unclear"
        max_label: "Very clear"
      - name: emotion
        type: radio
        labels: [Neutral, Happy, Sad, Angry]

Both Mode

Combine labeling with per-segment questions:

annotation_schemes:
  - annotation_type: audio
    name: full_analysis
    mode: both
    description: "Label and analyze audio segments"
    labels:
      - Speech
      - Music
      - Noise
    segment_questions:
      - name: quality
        type: likert
        size: 5

Configuration Options

Basic Setup

annotation_schemes:
  - annotation_type: audio
    name: segments
    description: "Create audio segments"
    labels:
      - Label A
      - Label B
 
    # Optional constraints
    min_segments: 1
    max_segments: 50

Keyboard Shortcuts

Labels can be assigned using number keys 1-9:

annotation_schemes:
  - annotation_type: audio
    name: speakers
    labels:
      - Speaker A  # Press 1
      - Speaker B  # Press 2
      - Overlap    # Press 3

Label Colors

Customize segment colors:

annotation_schemes:
  - annotation_type: audio
    name: segments
    labels:
      - Speech
      - Music
      - Silence
    label_colors:
      "Speech": "#3b82f6"
      "Music": "#10b981"
      "Silence": "#6b7280"

Waveform Performance

For optimal performance with long audio files, install the BBC audiowaveform tool:

# macOS
brew install audiowaveform
 
# Ubuntu/Debian
sudo apt-get install audiowaveform
 
# Or build from source
# https://github.com/bbc/audiowaveform

This enables server-side waveform generation. Without it, client-side generation is used (suitable for files under 30 minutes).

Waveform Caching

Configure caching for better performance:

audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 100  # Pre-generate waveforms for first N items
  client_fallback_max_duration: 1800  # 30 minutes in seconds

Data Format

Simple Audio Reference

[
  {"id": "1", "audio_path": "audio/recording_001.wav"},
  {"id": "2", "audio_path": "audio/recording_002.wav"}
]
data_files:
  - "data/audio_data.json"
 
item_properties:
  id_key: id
  audio_key: audio_path

With Transcripts

[
  {
    "id": "1",
    "audio_path": "audio/call_001.wav",
    "transcript": "Hello, how can I help you today?"
  }
]

Output Format

Annotations are saved with segment timestamps:

{
  "id": "audio_1",
  "annotations": {
    "segments": [
      {
        "start": 0.0,
        "end": 2.5,
        "label": "Speaker A",
        "questions": {
          "clarity": 4,
          "emotion": "Neutral"
        }
      },
      {
        "start": 2.5,
        "end": 5.2,
        "label": "Speaker B"
      }
    ]
  }
}

Keyboard Shortcuts

Potato provides extensive keyboard shortcuts for efficient annotation:

ShortcutAction
SpacePlay/Pause
[Set segment start at current position
]Set segment end at current position
1-9Assign label to current segment
DeleteRemove current segment
Left ArrowSeek backward 5 seconds
Right ArrowSeek forward 5 seconds
Up ArrowZoom in
Down ArrowZoom out
HomeGo to start
EndGo to end
+Increase playback speed
-Decrease playback speed

Example Configurations

Speaker Diarization

task_name: "Speaker Diarization"
task_dir: "."
port: 8000
 
data_files:
  - "data/recordings.json"
 
item_properties:
  id_key: id
  audio_key: audio_path
 
annotation_schemes:
  - annotation_type: audio
    name: speakers
    mode: label
    description: "Identify who is speaking"
    labels:
      - Speaker 1
      - Speaker 2
      - Speaker 3
      - Overlap
      - Silence
    label_colors:
      "Speaker 1": "#3b82f6"
      "Speaker 2": "#10b981"
      "Speaker 3": "#f59e0b"
      "Overlap": "#ef4444"
      "Silence": "#6b7280"
    min_segments: 1
 
audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 50
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

Transcription Review

task_name: "Transcription Quality Review"
task_dir: "."
port: 8000
 
data_files:
  - "data/transcripts.json"
 
item_properties:
  id_key: id
  text_key: transcript
  audio_key: audio_path
 
annotation_schemes:
  - annotation_type: audio
    name: errors
    mode: questions
    description: "Mark transcription errors"
    segment_questions:
      - name: error_type
        type: radio
        labels:
          - Missing word
          - Wrong word
          - Extra word
          - Spelling error
      - name: severity
        type: likert
        size: 3
        min_label: "Minor"
        max_label: "Major"
 
  - annotation_type: radio
    name: overall_accuracy
    description: "Overall transcript accuracy"
    labels:
      - Accurate
      - Minor errors
      - Major errors
      - Unusable
 
output_annotation_dir: "output/"
output_annotation_format: "json"

Call Center QA

task_name: "Call Center Quality Assurance"
task_dir: "."
port: 8000
 
data_files:
  - "data/calls.json"
 
item_properties:
  id_key: call_id
  audio_key: recording_path
 
annotation_schemes:
  # Segment-level annotation
  - annotation_type: audio
    name: conversation
    mode: both
    description: "Segment the conversation"
    labels:
      - Agent
      - Customer
      - Hold
      - Silence
    segment_questions:
      - name: sentiment
        type: radio
        labels: [Positive, Neutral, Negative, Frustrated]
 
  # Call-level assessment
  - annotation_type: likert
    name: professionalism
    description: "Agent professionalism"
    size: 5
    min_label: "Poor"
    max_label: "Excellent"
 
  - annotation_type: likert
    name: resolution
    description: "Issue resolution"
    size: 5
    min_label: "Unresolved"
    max_label: "Fully resolved"
 
  - annotation_type: multiselect
    name: issues
    description: "Select any issues observed"
    labels:
      - Long hold time
      - Agent interrupted
      - Incorrect information
      - Missing greeting
      - Unprofessional language
 
  - annotation_type: text
    name: notes
    description: "Additional observations"
    textarea: true
 
output_annotation_dir: "output/"
output_annotation_format: "json"

Supported Audio Formats

  • WAV (recommended for best quality)
  • MP3
  • OGG
  • FLAC
  • M4A
  • WebM

Performance Tips

  1. Install audiowaveform - Essential for long audio files
  2. Enable caching - Use cache_dir to store pre-generated waveforms
  3. Use WAV for quality - Compressed formats may introduce artifacts
  4. Pre-process audio - Normalize levels, trim unnecessary silence
  5. Consider file sizes - Large files slow down loading
  6. Use precompute - Pre-generate waveforms for initial instances

Troubleshooting

Waveform Not Loading

  • Check audio file path is correct
  • Verify file format is supported
  • Install audiowaveform for long files
  • Check browser console for errors

Slow Performance

  • Install audiowaveform tool
  • Enable waveform caching
  • Reduce audio file sizes
  • Use precompute_depth setting

Segments Not Saving

  • Ensure output directory is writable
  • Check annotation format configuration
  • Verify segment has both start and end times