# Audio Annotation

Source: https://www.potatoannotator.com/docs/features/audio-annotation

Potato 2.0 supports audio annotation with waveform visualization powered by Peaks.js, segment labeling, and keyboard shortcuts.

## Use Cases

- Speech transcription and review
- Speaker diarization
- Music analysis
- Audio event detection
- Emotion recognition in speech
- Call center quality assurance

## Enabling Audio Support

Add an `audio_annotation` section to your configuration:

```yaml
annotation_schemes:
  - annotation_type: audio
    name: audio_segments
    description: "Segment and label the audio"
    labels:
      - Speech
      - Music
      - Silence
      - Noise
```

## Operational Modes

Potato supports three audio annotation modes:

### Label Mode

Segment audio and assign category labels to each segment:

```yaml
annotation_schemes:
  - annotation_type: audio
    name: speaker_diarization
    mode: label
    description: "Identify speakers in the audio"
    labels:
      - Speaker A
      - Speaker B
      - Overlap
    label_colors:
      "Speaker A": "#3b82f6"
      "Speaker B": "#10b981"
      "Overlap": "#f59e0b"
```

### Questions Mode

Add per-segment annotation questions:

```yaml
annotation_schemes:
  - annotation_type: audio
    name: speech_quality
    mode: questions
    description: "Evaluate speech segments"
    segment_questions:
      - name: clarity
        type: likert
        size: 5
        min_label: "Unclear"
        max_label: "Very clear"
      - name: emotion
        type: radio
        labels: [Neutral, Happy, Sad, Angry]
```

### Both Mode

Combine labeling with per-segment questions:

```yaml
annotation_schemes:
  - annotation_type: audio
    name: full_analysis
    mode: both
    description: "Label and analyze audio segments"
    labels:
      - Speech
      - Music
      - Noise
    segment_questions:
      - name: quality
        type: likert
        size: 5
```

## Configuration Options

### Basic Setup

```yaml
annotation_schemes:
  - annotation_type: audio
    name: segments
    description: "Create audio segments"
    labels:
      - Label A
      - Label B

    # Optional constraints
    min_segments: 1
    max_segments: 50
```

### Keyboard Shortcuts

Labels can be assigned using number keys 1-9:

```yaml
annotation_schemes:
  - annotation_type: audio
    name: speakers
    labels:
      - Speaker A  # Press 1
      - Speaker B  # Press 2
      - Overlap    # Press 3
```

### Label Colors

Customize segment colors:

```yaml
annotation_schemes:
  - annotation_type: audio
    name: segments
    labels:
      - Speech
      - Music
      - Silence
    label_colors:
      "Speech": "#3b82f6"
      "Music": "#10b981"
      "Silence": "#6b7280"
```

## Waveform Performance

For optimal performance with long audio files, install the BBC audiowaveform tool:

```bash
# macOS
brew install audiowaveform

# Ubuntu/Debian
sudo apt-get install audiowaveform

# Or build from source
# https://github.com/bbc/audiowaveform
```

This enables server-side waveform generation. Without it, client-side generation is used (suitable for files under 30 minutes).

### Waveform Caching

Configure caching for better performance:

```yaml
audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 100  # Pre-generate waveforms for first N items
  client_fallback_max_duration: 1800  # 30 minutes in seconds
```

## Data Format

### Simple Audio Reference

```json
[
  {"id": "1", "audio_path": "audio/recording_001.wav"},
  {"id": "2", "audio_path": "audio/recording_002.wav"}
]
```

```yaml
data_files:
  - "data/audio_data.json"

item_properties:
  id_key: id
  audio_key: audio_path
```

### With Transcripts

```json
[
  {
    "id": "1",
    "audio_path": "audio/call_001.wav",
    "transcript": "Hello, how can I help you today?"
  }
]
```

## Output Format

Annotations are saved with segment timestamps:

```json
{
  "id": "audio_1",
  "annotations": {
    "segments": [
      {
        "start": 0.0,
        "end": 2.5,
        "label": "Speaker A",
        "questions": {
          "clarity": 4,
          "emotion": "Neutral"
        }
      },
      {
        "start": 2.5,
        "end": 5.2,
        "label": "Speaker B"
      }
    ]
  }
}
```

## Keyboard Shortcuts

Potato provides keyboard shortcuts for efficient annotation:

| Shortcut | Action |
|----------|--------|
| `Space` | Play/Pause |
| `[` | Set segment start at current position |
| `]` | Set segment end at current position |
| `1-9` | Assign label to current segment |
| `Delete` | Remove current segment |
| `Left Arrow` | Seek backward 5 seconds |
| `Right Arrow` | Seek forward 5 seconds |
| `Up Arrow` | Zoom in |
| `Down Arrow` | Zoom out |
| `Home` | Go to start |
| `End` | Go to end |
| `+` | Increase playback speed |
| `-` | Decrease playback speed |

## Example Configurations

### Speaker Diarization

```yaml
task_name: "Speaker Diarization"
task_dir: "."
port: 8000

data_files:
  - "data/recordings.json"

item_properties:
  id_key: id
  audio_key: audio_path

annotation_schemes:
  - annotation_type: audio
    name: speakers
    mode: label
    description: "Identify who is speaking"
    labels:
      - Speaker 1
      - Speaker 2
      - Speaker 3
      - Overlap
      - Silence
    label_colors:
      "Speaker 1": "#3b82f6"
      "Speaker 2": "#10b981"
      "Speaker 3": "#f59e0b"
      "Overlap": "#ef4444"
      "Silence": "#6b7280"
    min_segments: 1

audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 50

output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true
```

### Transcription Review

```yaml
task_name: "Transcription Quality Review"
task_dir: "."
port: 8000

data_files:
  - "data/transcripts.json"

item_properties:
  id_key: id
  text_key: transcript
  audio_key: audio_path

annotation_schemes:
  - annotation_type: audio
    name: errors
    mode: questions
    description: "Mark transcription errors"
    segment_questions:
      - name: error_type
        type: radio
        labels:
          - Missing word
          - Wrong word
          - Extra word
          - Spelling error
      - name: severity
        type: likert
        size: 3
        min_label: "Minor"
        max_label: "Major"

  - annotation_type: radio
    name: overall_accuracy
    description: "Overall transcript accuracy"
    labels:
      - Accurate
      - Minor errors
      - Major errors
      - Unusable

output_annotation_dir: "output/"
output_annotation_format: "json"
```

### Call Center QA

```yaml
task_name: "Call Center Quality Assurance"
task_dir: "."
port: 8000

data_files:
  - "data/calls.json"

item_properties:
  id_key: call_id
  audio_key: recording_path

annotation_schemes:
  # Segment-level annotation
  - annotation_type: audio
    name: conversation
    mode: both
    description: "Segment the conversation"
    labels:
      - Agent
      - Customer
      - Hold
      - Silence
    segment_questions:
      - name: sentiment
        type: radio
        labels: [Positive, Neutral, Negative, Frustrated]

  # Call-level assessment
  - annotation_type: likert
    name: professionalism
    description: "Agent professionalism"
    size: 5
    min_label: "Poor"
    max_label: "Excellent"

  - annotation_type: likert
    name: resolution
    description: "Issue resolution"
    size: 5
    min_label: "Unresolved"
    max_label: "Fully resolved"

  - annotation_type: multiselect
    name: issues
    description: "Select any issues observed"
    labels:
      - Long hold time
      - Agent interrupted
      - Incorrect information
      - Missing greeting
      - Unprofessional language

  - annotation_type: text
    name: notes
    description: "Additional observations"
    textarea: true

output_annotation_dir: "output/"
output_annotation_format: "json"
```

## Supported Audio Formats

- WAV (recommended for best quality)
- MP3
- OGG
- FLAC
- M4A
- WebM

## Performance Tips

1. **Install audiowaveform** - Essential for long audio files
2. **Enable caching** - Use `cache_dir` to store pre-generated waveforms
3. **Use WAV for quality** - Compressed formats may introduce artifacts
4. **Pre-process audio** - Normalize levels, trim unnecessary silence
5. **Consider file sizes** - Large files slow down loading
6. **Use precompute** - Pre-generate waveforms for initial instances

## Troubleshooting

### Waveform Not Loading

- Check audio file path is correct
- Verify file format is supported
- Install audiowaveform for long files
- Check browser console for errors

### Slow Performance

- Install audiowaveform tool
- Enable waveform caching
- Reduce audio file sizes
- Use precompute_depth setting

### Segments Not Saving

- Ensure output directory is writable
- Check annotation format configuration
- Verify segment has both start and end times