Setting Up Audio Transcription Review
Configure waveform visualization, playback controls, and text correction interfaces for audio transcription tasks.
Setting Up Audio Transcription Review
Transcription review is essential for quality ASR training data. This tutorial shows you how to build an interface where annotators can listen to audio, view waveforms, and correct machine-generated transcriptions.
What We're Building
An interface with:
- Waveform visualization
- Playback controls (play, pause, speed adjustment)
- Editable transcript text
- Quality rating for audio
- Confidence marking for uncertain segments
Basic Configuration
annotation_task_name: "Transcription Review"
data_files:
- "data/transcripts.json"
item_properties:
id_key: id
text_key: asr_transcript
annotation_schemes:
# Audio playback
- annotation_type: audio_annotation
name: audio_player
audio_key: audio_path
# Corrected transcript
- annotation_type: text
name: corrected_transcript
description: "Edit the transcript to match what you hear"
textarea: true
placeholder: "Type the corrected transcript..."
required: true
# Quality rating
- annotation_type: radio
name: audio_quality
description: "Rate the audio quality"
labels:
- Clear
- Slightly noisy
- Very noisy
- UnintelligibleSample Data Format
Create data/transcripts.json:
{"id": "audio_001", "audio_path": "/audio/recording_001.wav", "asr_transcript": "Hello how are you doing today"}
{"id": "audio_002", "audio_path": "/audio/recording_002.wav", "asr_transcript": "The weather is nice outside"}
{"id": "audio_003", "audio_path": "/audio/recording_003.wav", "asr_transcript": "Please call me back when your free"}Audio Annotation Setup
Audio annotation in Potato uses the audio_annotation type within annotation schemes. The audio player provides waveform visualization and playback controls automatically.
annotation_schemes:
- annotation_type: audio_annotation
name: audio_player
audio_key: audio_path
description: "Listen to the audio recording"The audio player includes built-in controls for play/pause, seeking, and speed adjustment.
Comprehensive Transcription Interface
annotation_task_name: "ASR Correction and Annotation"
data_files:
- "data/asr_output.json"
item_properties:
id_key: id
text_key: hypothesis
annotation_schemes:
# Audio player
- annotation_type: audio_annotation
name: audio_player
audio_key: audio_url
# Main transcript correction
- annotation_type: text
name: transcript
description: "Correct the transcript below"
textarea: true
rows: 4
required: true
# Speaker identification
- annotation_type: radio
name: num_speakers
description: "How many speakers are in this recording?"
labels:
- "1 speaker"
- "2 speakers"
- "3+ speakers"
- "Cannot determine"
# Audio quality
- annotation_type: radio
name: quality
description: "Overall audio quality"
labels:
- name: Excellent
description: "Crystal clear, studio quality"
- name: Good
description: "Clear speech, minor background noise"
- name: Fair
description: "Understandable but noisy"
- name: Poor
description: "Very difficult to understand"
- name: Unusable
description: "Cannot transcribe accurately"
# Issues checklist
- annotation_type: multiselect
name: issues
description: "Select all issues present (if any)"
labels:
- Background noise
- Overlapping speech
- Accented speech
- Fast speech
- Mumbling/unclear
- Technical audio issues
- Non-English words
- Profanity present
- None
# Confidence
- annotation_type: likert
name: confidence
description: "How confident are you in your transcription?"
size: 5
min_label: "Guessing"
max_label: "Certain"
annotation_guidelines:
title: "Transcription Guidelines"
content: |
## Your Task
Listen to the audio and correct the ASR transcript.
## Transcription Rules
- Transcribe exactly what is said
- Include filler words (um, uh, like)
- Use proper punctuation and capitalization
- Mark unintelligible sections with [unintelligible]
- Mark uncertain words with [word?]
## Special Notations
- [unintelligible] - Cannot understand
- [word?] - Uncertain about word
- [crosstalk] - Overlapping speech
- [noise] - Non-speech sound
- [pause] - Significant silenceWord-Level Annotation
For detailed word-level corrections, you can use span annotation alongside text fields:
annotation_schemes:
- annotation_type: audio_annotation
name: audio_player
audio_key: audio_path
- annotation_type: text
name: transcript
textarea: true
- annotation_type: span
name: word_corrections
description: "Mark words that needed correction"
source_field: transcript
labels:
- name: corrected
color: "#FCD34D"
description: "Word was changed"
- name: inserted
color: "#4ADE80"
description: "Word was added"
- name: uncertain
color: "#F87171"
description: "Still not sure"Segment-Based Transcription
For long audio files, you can prepare your data as segments with timing information:
data_files:
- "data/segments.json"
item_properties:
id_key: id
text_key: asr_text
annotation_schemes:
- annotation_type: audio_annotation
name: audio_player
audio_key: audio_path
- annotation_type: text
name: transcript
textarea: true
description: "Correct the transcript for this segment"Data format with segment timing:
{
"id": "seg_001",
"audio_path": "/audio/long_recording.wav",
"start_time": 0.0,
"end_time": 5.5,
"asr_text": "Welcome to today's presentation"
}Output Format
{
"id": "audio_001",
"audio_path": "/audio/recording_001.wav",
"original_transcript": "Hello how are you doing today",
"annotations": {
"transcript": "Hello, how are you doing today?",
"num_speakers": "1 speaker",
"quality": "Good",
"issues": ["None"],
"confidence": 5
},
"annotator": "transcriber_01",
"time_spent_seconds": 45
}Quality Control
Potato tracks annotation time automatically. For quality control, consider including attention check items in your data file - items with known correct answers that you can use to verify annotator accuracy.
You can configure output settings to track annotations:
output_annotation_dir: "annotation_output"
output_annotation_format: "json"Tips for Transcription Tasks
- Good headphones: Essential for accuracy
- Quiet environment: Reduces fatigue
- Speed adjustment: Slow down for difficult sections
- Multiple passes: Listen once, transcribe, then verify
- Regular breaks: Transcription is mentally demanding
Next Steps
- Add speaker diarization for multi-speaker audio
- Set up emotion classification alongside transcription
- Configure crowdsourcing for large-scale transcription
Full audio documentation at /docs/features/audio-annotation.