# Audio Annotation

Source: https://www.potatoannotator.com/docs/annotation-types/audio-annotation

Potato's audio annotation tool enables annotators to segment audio files and assign labels to time regions through a waveform-based interface.

## Features

- Waveform visualization
- Time-based segment creation
- Label assignment to segments
- Playback controls with variable speed
- Zoom and scroll navigation
- Keyboard shortcuts
- Server-side waveform caching

## Basic Configuration

```yaml
annotation_schemes:
  - name: "speakers"
    description: "Mark when each speaker is talking"
    annotation_type: "audio_annotation"
    labels:
      - name: "Speaker 1"
        color: "#3B82F6"
      - name: "Speaker 2"
        color: "#10B981"
```

## Configuration Options

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `name` | string | Required | Unique identifier for the annotation |
| `description` | string | Required | Instructions shown to annotators |
| `annotation_type` | string | Required | Must be `"audio_annotation"` |
| `mode` | string | `"label"` | Annotation mode: `"label"`, `"questions"`, or `"both"` |
| `labels` | list | Conditional | Required for `label` or `both` modes |
| `segment_schemes` | list | Conditional | Required for `questions` or `both` modes |
| `min_segments` | integer | 0 | Minimum segments required |
| `max_segments` | integer | null | Maximum segments allowed (null = unlimited) |
| `zoom_enabled` | boolean | true | Enable zoom controls |
| `playback_rate_control` | boolean | false | Show playback speed selector |

## Label Configuration

```yaml
labels:
  - name: "speech"
    color: "#3B82F6"
    key_value: "1"
  - name: "music"
    color: "#10B981"
    key_value: "2"
  - name: "silence"
    color: "#64748B"
    key_value: "3"
```

## Annotation Modes

### Label Mode (Default)

Segments receive category labels:

```yaml
annotation_schemes:
  - name: "emotion"
    description: "Label the emotion in each segment"
    annotation_type: "audio_annotation"
    mode: "label"
    labels:
      - name: "happy"
        color: "#22C55E"
      - name: "sad"
        color: "#3B82F6"
      - name: "angry"
        color: "#EF4444"
      - name: "neutral"
        color: "#64748B"
```

### Questions Mode

Each segment answers dedicated questions:

```yaml
annotation_schemes:
  - name: "transcription"
    description: "Transcribe each segment"
    annotation_type: "audio_annotation"
    mode: "questions"
    segment_schemes:
      - name: "transcript"
        annotation_type: "text"
        description: "Enter the transcription"
      - name: "confidence"
        annotation_type: "likert"
        description: "How confident are you?"
        size: 5
```

### Both Mode

Combines labeling with per-segment questionnaires:

```yaml
annotation_schemes:
  - name: "detailed_diarization"
    description: "Label speakers and add notes"
    annotation_type: "audio_annotation"
    mode: "both"
    labels:
      - name: "Speaker A"
        color: "#3B82F6"
      - name: "Speaker B"
        color: "#10B981"
    segment_schemes:
      - name: "notes"
        annotation_type: "text"
        description: "Any notes about this segment?"
```

## Global Audio Configuration

Configure waveform handling in your config file:

```yaml
audio_annotation:
  waveform_cache_dir: "waveform_cache/"
  waveform_look_ahead: 5
  waveform_cache_max_size: 1000
  client_fallback_max_duration: 1800
```

| Field | Description |
|-------|-------------|
| `waveform_cache_dir` | Directory for cached waveform data |
| `waveform_look_ahead` | Number of upcoming instances to pre-compute |
| `waveform_cache_max_size` | Maximum number of cached waveform files |
| `client_fallback_max_duration` | Max seconds for browser-side waveform generation (default: 1800) |

## Examples

### Speaker Diarization

```yaml
annotation_schemes:
  - name: "diarization"
    description: "Identify who is speaking at each moment"
    annotation_type: "audio_annotation"
    mode: "label"
    labels:
      - name: "Interviewer"
        color: "#8B5CF6"
        key_value: "1"
      - name: "Guest"
        color: "#EC4899"
        key_value: "2"
      - name: "Overlap"
        color: "#F59E0B"
        key_value: "3"
    zoom_enabled: true
    playback_rate_control: true
```

### Sound Event Detection

```yaml
annotation_schemes:
  - name: "sound_events"
    description: "Mark all sound events"
    annotation_type: "audio_annotation"
    labels:
      - name: "speech"
        color: "#3B82F6"
      - name: "music"
        color: "#10B981"
      - name: "applause"
        color: "#F59E0B"
      - name: "laughter"
        color: "#EC4899"
      - name: "silence"
        color: "#64748B"
    min_segments: 1
```

### Transcription Review

```yaml
annotation_schemes:
  - name: "transcription_review"
    description: "Review and correct the transcription for each segment"
    annotation_type: "audio_annotation"
    mode: "questions"
    segment_schemes:
      - name: "transcript"
        annotation_type: "text"
        description: "Enter or correct the transcription"
        multiline: true
      - name: "quality"
        annotation_type: "radio"
        description: "Audio quality"
        labels:
          - "Clear"
          - "Noisy"
          - "Unintelligible"
```

## Keyboard Shortcuts

| Key | Action |
|-----|--------|
| `Space` | Play/pause |
| `←` / `→` | Seek backward/forward |
| `[` | Mark segment start |
| `]` | Mark segment end |
| `Enter` | Create segment |
| `Delete` | Remove selected segment |
| `1-9` | Select label |
| `+` / `-` | Zoom in/out |
| `0` | Fit view |

## Data Format

### Input Data

Your data file should include audio file paths or URLs:

```json
[
  {
    "id": "audio_001",
    "audio_url": "https://example.com/audio/recording1.mp3"
  },
  {
    "id": "audio_002",
    "audio_url": "/data/audio/recording2.wav"
  }
]
```

Configure the audio field:

```yaml
item_properties:
  id_key: id
  text_key: audio_url
```

### Output Format

```json
{
  "id": "audio_001",
  "annotations": {
    "diarization": [
      {
        "start": 0.0,
        "end": 5.5,
        "label": "Interviewer"
      },
      {
        "start": 5.5,
        "end": 12.3,
        "label": "Guest"
      },
      {
        "start": 12.3,
        "end": 14.0,
        "label": "Overlap"
      }
    ]
  }
}
```

For questions mode, segments include nested responses:

```json
{
  "start": 0.0,
  "end": 5.5,
  "transcript": "Hello and welcome to the show.",
  "quality": "Clear"
}
```

## Supported Audio Formats

- MP3 (recommended)
- WAV
- OGG
- M4A

## Best Practices

1. **Pre-cache waveforms** - Use server-side caching for large datasets
2. **Enable playback control** - Variable speed helps with precise segmentation
3. **Use keyboard shortcuts** - Much faster than clicking
4. **Define clear boundaries** - Specify what constitutes segment start/end
5. **Choose appropriate mode** - Use "label" for classification, "questions" for detailed annotation
6. **Set segment limits** - Use `min_segments` to ensure coverage