# Setting Up Audio Transcription Review

Source: https://www.potatoannotator.com/blog/audio-transcription-task

Good ASR training data usually starts with a human checking the machine's first draft. This tutorial shows how to build an interface where annotators listen to the audio, see the waveform, and fix the machine-generated transcript. For the audio options it relies on, see the [audio annotation documentation](https://github.com/davidjurgens/potato/blob/master/docs/annotation-types/multimedia/audio_annotation.md).

## What We're Building

An interface with:
- Waveform visualization
- Playback controls (play, pause, speed adjustment)
- Editable transcript text
- Quality rating for audio
- Confidence marking for uncertain segments

## Basic Configuration

```yaml
annotation_task_name: "Transcription Review"

data_files:
  - "data/transcripts.json"

item_properties:
  id_key: id
  text_key: asr_transcript

annotation_schemes:
  # Audio playback
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path

  # Corrected transcript
  - annotation_type: text
    name: corrected_transcript
    description: "Edit the transcript to match what you hear"
    multiline: true
    placeholder: "Type the corrected transcript..."
    required: true

  # Quality rating
  - annotation_type: radio
    name: audio_quality
    description: "Rate the audio quality"
    labels:
      - Clear
      - Slightly noisy
      - Very noisy
      - Unintelligible
```

## Sample Data Format

Create `data/transcripts.json`:

```json
{"id": "audio_001", "audio_path": "/audio/recording_001.wav", "asr_transcript": "Hello how are you doing today"}
{"id": "audio_002", "audio_path": "/audio/recording_002.wav", "asr_transcript": "The weather is nice outside"}
{"id": "audio_003", "audio_path": "/audio/recording_003.wav", "asr_transcript": "Please call me back when your free"}
```

## Audio Annotation Setup

Audio annotation in Potato uses the `audio_annotation` type inside your annotation schemes. The player draws the waveform and adds playback controls on its own, so you do not have to wire those up:

```yaml
annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
    description: "Listen to the audio recording"
```

The audio player includes built-in controls for play/pause, seeking, and speed adjustment.

## Full Transcription Interface

```yaml
annotation_task_name: "ASR Correction and Annotation"

data_files:
  - "data/asr_output.json"

item_properties:
  id_key: id
  text_key: hypothesis

annotation_schemes:
  # Audio player
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_url

  # Main transcript correction
  - annotation_type: text
    name: transcript
    description: "Correct the transcript below"
    multiline: true
    rows: 4
    required: true

  # Speaker identification
  - annotation_type: radio
    name: num_speakers
    description: "How many speakers are in this recording?"
    labels:
      - "1 speaker"
      - "2 speakers"
      - "3+ speakers"
      - "Cannot determine"

  # Audio quality
  - annotation_type: radio
    name: quality
    description: "Overall audio quality"
    labels:
      - name: Excellent
        description: "Crystal clear, studio quality"
      - name: Good
        description: "Clear speech, minor background noise"
      - name: Fair
        description: "Understandable but noisy"
      - name: Poor
        description: "Very difficult to understand"
      - name: Unusable
        description: "Cannot transcribe accurately"

  # Issues checklist
  - annotation_type: multiselect
    name: issues
    description: "Select all issues present (if any)"
    labels:
      - Background noise
      - Overlapping speech
      - Accented speech
      - Fast speech
      - Mumbling/unclear
      - Technical audio issues
      - Non-English words
      - Profanity present
      - None

  # Confidence
  - annotation_type: likert
    name: confidence
    description: "How confident are you in your transcription?"
    size: 5
    min_label: "Guessing"
    max_label: "Certain"

annotation_guidelines:
  title: "Transcription Guidelines"
  content: |
    ## Your Task
    Listen to the audio and correct the ASR transcript.

    ## Transcription Rules
    - Transcribe exactly what is said
    - Include filler words (um, uh, like)
    - Use proper punctuation and capitalization
    - Mark unintelligible sections with [unintelligible]
    - Mark uncertain words with [word?]

    ## Special Notations
    - [unintelligible] - Cannot understand
    - [word?] - Uncertain about word
    - [crosstalk] - Overlapping speech
    - [noise] - Non-speech sound
    - [pause] - Significant silence
```

## Word-Level Annotation

For detailed word-level corrections, you can use span annotation alongside text fields:

```yaml
annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path

  - annotation_type: text
    name: transcript
    multiline: true

  - annotation_type: span
    name: word_corrections
    description: "Mark words that needed correction"
    source_field: transcript
    labels:
      - name: corrected
        color: "#FCD34D"
        description: "Word was changed"
      - name: inserted
        color: "#4ADE80"
        description: "Word was added"
      - name: uncertain
        color: "#F87171"
        description: "Still not sure"
```

## Segment-Based Transcription

For long audio files, you can prepare your data as segments with timing information:

```yaml
data_files:
  - "data/segments.json"

item_properties:
  id_key: id
  text_key: asr_text

annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path

  - annotation_type: text
    name: transcript
    multiline: true
    description: "Correct the transcript for this segment"
```

Data format with segment timing:

```json
{
  "id": "seg_001",
  "audio_path": "/audio/long_recording.wav",
  "start_time": 0.0,
  "end_time": 5.5,
  "asr_text": "Welcome to today's presentation"
}
```

## Output Format

```json
{
  "id": "audio_001",
  "audio_path": "/audio/recording_001.wav",
  "original_transcript": "Hello how are you doing today",
  "annotations": {
    "transcript": "Hello, how are you doing today?",
    "num_speakers": "1 speaker",
    "quality": "Good",
    "issues": ["None"],
    "confidence": 5
  },
  "annotator": "transcriber_01",
  "time_spent_seconds": 45
}
```

## Quality Control

Potato tracks annotation time automatically. For quality control, mix a few attention-check items into your data file: clips with a known correct answer that let you spot annotators who are not actually listening.

You can configure where and how annotations are written:

```yaml
output_annotation_dir: "annotation_output"
export_annotation_format: "json"
```

## Tips for Transcription Tasks

Decent headphones and a quiet room do most of the work for accuracy. Slow the audio down for the parts you cannot quite make out, and plan on more than one pass: listen, transcribe, then go back and verify. Transcription is mentally draining, so build in regular breaks.

## Next Steps

- Add [speaker diarization](/blog/speaker-diarization-annotation) for multi-speaker audio
- Set up [emotion classification](/blog/audio-emotion-classification) alongside transcription
- Configure [crowdsourcing](/blog/prolific-integration) for large-scale transcription

---

*Full audio documentation at [/docs/features/audio-annotation](/docs/features/audio-annotation).*