# Speaker Diarization Annotation

Source: https://www.potatoannotator.com/blog/speaker-diarization-annotation

Speaker diarization answers the question "who spoke when?" This tutorial covers building interfaces for annotating speaker turns, correcting automatic diarization, and handling conversations with several speakers. For the full audio scheme reference, see the [source documentation](https://github.com/davidjurgens/potato/blob/master/docs/annotation-types/multimedia/audio_annotation.md).

## What is speaker diarization?

Speaker diarization splits audio into stretches that belong to a single speaker. It comes up in:

- Meeting transcription
- Call center analytics
- Podcast production
- Interview processing
- Court/legal recordings

## Basic Diarization Setup

```yaml
annotation_task_name: "Speaker Diarization"

data_files:
  - "data/conversations.json"

annotation_schemes:
  - annotation_type: audio_annotation
    name: speakers
    description: "Mark when each speaker talks"
    labels:
      - name: Speaker 1
        color: "#FF6B6B"
        keyboard_shortcut: "1"
      - name: Speaker 2
        color: "#4ECDC4"
        keyboard_shortcut: "2"
      - name: Speaker 3
        color: "#45B7D1"
        keyboard_shortcut: "3"
      - name: Overlap
        color: "#FFEAA7"
        keyboard_shortcut: "o"
      - name: Silence
        color: "#9CA3AF"
        keyboard_shortcut: "s"
```

## Creating Speaker Segments

### Workflow

1. Play the audio or click the waveform to navigate
2. Click and drag on the waveform to select a time range
3. Press a number key or click a speaker label
4. The segment is colored and labeled
5. Adjust boundaries by dragging edges
6. Continue until entire audio is segmented

### Keyboard Controls

Potato provides built-in keyboard shortcuts for audio playback control including play/pause and navigation.

## Pre-annotated diarization correction

A lot of the time you're not labeling from scratch, you're fixing the output of an automatic diarizer:

```yaml
data_files:
  - "data/auto_diarized.json"
```

Data format:

```json
{
  "id": "meeting_001",
  "audio_path": "/audio/meeting_001.wav",
  "auto_segments": [
    {"start": 0.0, "end": 3.5, "speaker": "Speaker 1"},
    {"start": 3.5, "end": 8.2, "speaker": "Speaker 2"},
    {"start": 8.2, "end": 12.0, "speaker": "Speaker 1"}
  ]
}
```

## Detailed speaker information

Capture extra metadata about each speaker:

```yaml
annotation_schemes:
  - annotation_type: audio_annotation
    name: speakers
    labels:
      - name: Speaker A
        color: "#FF6B6B"
      - name: Speaker B
        color: "#4ECDC4"
      - name: Speaker C
        color: "#45B7D1"
      - name: Unknown
        color: "#9CA3AF"

  # Speaker characteristics
  - annotation_type: radio
    name: speaker_a_gender
    description: "Speaker A Gender"
    labels:
      - Male
      - Female
      - Unknown

  - annotation_type: text
    name: speaker_a_role
    description: "Speaker A Role (if identifiable)"

  - annotation_type: radio
    name: speaker_b_gender
    description: "Speaker B Gender"
    labels:
      - Male
      - Female
      - Unknown
```

## Handling Overlapping Speech

```yaml
annotation_schemes:
  - annotation_type: audio_annotation
    name: speakers
    labels:
      - name: Speaker 1
        color: "#FF6B6B"
      - name: Speaker 2
        color: "#4ECDC4"
      - name: Overlap
        color: "#FFEAA7"
```

## Meeting/Interview Diarization

```yaml
annotation_task_name: "Meeting Diarization"

data_files:
  - "data/meetings.json"

annotation_schemes:
  # Speaker turns
  - annotation_type: audio_annotation
    name: turns
    description: "Mark each speaker turn"
    labels:
      - name: Moderator
        color: "#EF4444"
        keyboard_shortcut: "m"
      - name: Participant 1
        color: "#3B82F6"
        keyboard_shortcut: "1"
      - name: Participant 2
        color: "#10B981"
        keyboard_shortcut: "2"
      - name: Participant 3
        color: "#F59E0B"
        keyboard_shortcut: "3"
      - name: Participant 4
        color: "#8B5CF6"
        keyboard_shortcut: "4"
      - name: Unknown
        color: "#6B7280"
        keyboard_shortcut: "u"
      - name: Overlap
        color: "#FCD34D"
        keyboard_shortcut: "o"
      - name: Silence/Noise
        color: "#D1D5DB"
        keyboard_shortcut: "s"

  # Speech type annotation
  - annotation_type: radio
    name: speech_type
    description: "Type of speech"
    labels:
      - Statement
      - Question
      - Response
      - Interruption
      - Backchannel

  # Overall quality
  - annotation_type: radio
    name: recording_quality
    description: "Overall recording quality"
    labels:
      - Excellent - All speakers clear
      - Good - Most speech understandable
      - Fair - Some difficulty
      - Poor - Significant issues
```

## Output Format

```json
{
  "id": "meeting_001",
  "audio_path": "/audio/meeting_001.wav",
  "annotations": {
    "turns": [
      {
        "start": 0.0,
        "end": 5.2,
        "label": "Moderator",
        "attributes": {
          "speech_type": "Statement"
        }
      },
      {
        "start": 5.2,
        "end": 12.8,
        "label": "Participant 1",
        "attributes": {
          "speech_type": "Response"
        }
      },
      {
        "start": 11.5,
        "end": 12.8,
        "label": "Overlap"
      }
    ],
    "recording_quality": "Good - Most speech understandable"
  }
}
```

## Tips for Diarization

1. **Listen first**: Get familiar with speakers before annotating
2. **Note speaker characteristics**: Pitch, accent, speaking style
3. **Handle overlaps consistently**: Decide on a strategy upfront
4. **Use speed control**: Slow down for difficult sections
5. **Mark uncertainty**: It's okay to use "Unknown" when needed

## Next Steps

- Combine with [transcription](/blog/audio-transcription-task) for full meeting notes
- Add [emotion detection](/blog/audio-emotion-classification) per speaker
- Set up [quality control](/blog/quality-control-strategies) for multi-annotator agreement

---

*See [/docs/features/audio-annotation](/docs/features/audio-annotation) for complete audio documentation.*