La révision de transcription est essentielle pour obtenir des données d'entraînement ASR de qualité. Ce tutoriel vous montre comment construire une interface où les annotateurs peuvent écouter l'audio, visualiser les formes d'onde et corriger les transcriptions générées par machine.

Ce que nous construisons

Une interface avec :

Visualisation de la forme d'onde
Contrôles de lecture (lecture, pause, ajustement de la vitesse)
Texte de transcription modifiable
Évaluation de la qualité audio
Marquage de confiance pour les segments incertains

Configuration de base

yaml

annotation_task_name: "Transcription Review"
 
data_files:
  - "data/transcripts.json"
 
item_properties:
  id_key: id
  text_key: asr_transcript
 
annotation_schemes:
  # Audio playback
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
 
  # Corrected transcript
  - annotation_type: text
    name: corrected_transcript
    description: "Edit the transcript to match what you hear"
    multiline: true
    placeholder: "Type the corrected transcript..."
    required: true
 
  # Quality rating
  - annotation_type: radio
    name: audio_quality
    description: "Rate the audio quality"
    labels:
      - Clear
      - Slightly noisy
      - Very noisy
      - Unintelligible

Format des données d'exemple

Créez data/transcripts.json :

json

{"id": "audio_001", "audio_path": "/audio/recording_001.wav", "asr_transcript": "Hello how are you doing today"}
{"id": "audio_002", "audio_path": "/audio/recording_002.wav", "asr_transcript": "The weather is nice outside"}
{"id": "audio_003", "audio_path": "/audio/recording_003.wav", "asr_transcript": "Please call me back when your free"}

Configuration de l'annotation audio

L'annotation audio dans Potato utilise le type audio_annotation dans les schémas d'annotation. Le lecteur audio fournit automatiquement la visualisation de la forme d'onde et les contrôles de lecture.

yaml

annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
    description: "Listen to the audio recording"

Le lecteur audio inclut des contrôles intégrés pour la lecture/pause, la navigation et l'ajustement de la vitesse.

Interface de transcription complète

yaml

annotation_task_name: "ASR Correction and Annotation"
 
data_files:
  - "data/asr_output.json"
 
item_properties:
  id_key: id
  text_key: hypothesis
 
annotation_schemes:
  # Audio player
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_url
 
  # Main transcript correction
  - annotation_type: text
    name: transcript
    description: "Correct the transcript below"
    multiline: true
    rows: 4
    required: true
 
  # Speaker identification
  - annotation_type: radio
    name: num_speakers
    description: "How many speakers are in this recording?"
    labels:
      - "1 speaker"
      - "2 speakers"
      - "3+ speakers"
      - "Cannot determine"
 
  # Audio quality
  - annotation_type: radio
    name: quality
    description: "Overall audio quality"
    labels:
      - name: Excellent
        description: "Crystal clear, studio quality"
      - name: Good
        description: "Clear speech, minor background noise"
      - name: Fair
        description: "Understandable but noisy"
      - name: Poor
        description: "Very difficult to understand"
      - name: Unusable
        description: "Cannot transcribe accurately"
 
  # Issues checklist
  - annotation_type: multiselect
    name: issues
    description: "Select all issues present (if any)"
    labels:
      - Background noise
      - Overlapping speech
      - Accented speech
      - Fast speech
      - Mumbling/unclear
      - Technical audio issues
      - Non-English words
      - Profanity present
      - None
 
  # Confidence
  - annotation_type: likert
    name: confidence
    description: "How confident are you in your transcription?"
    size: 5
    min_label: "Guessing"
    max_label: "Certain"
 
annotation_guidelines:
  title: "Transcription Guidelines"
  content: |
    ## Your Task
    Listen to the audio and correct the ASR transcript.
 
    ## Transcription Rules
    - Transcribe exactly what is said
    - Include filler words (um, uh, like)
    - Use proper punctuation and capitalization
    - Mark unintelligible sections with [unintelligible]
    - Mark uncertain words with [word?]
 
    ## Special Notations
    - [unintelligible] - Cannot understand
    - [word?] - Uncertain about word
    - [crosstalk] - Overlapping speech
    - [noise] - Non-speech sound
    - [pause] - Significant silence

Annotation au niveau des mots

Pour des corrections détaillées au niveau des mots, vous pouvez utiliser l'annotation par span en complément des champs texte :

yaml

annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
 
  - annotation_type: text
    name: transcript
    multiline: true
 
  - annotation_type: span
    name: word_corrections
    description: "Mark words that needed correction"
    source_field: transcript
    labels:
      - name: corrected
        color: "#FCD34D"
        description: "Word was changed"
      - name: inserted
        color: "#4ADE80"
        description: "Word was added"
      - name: uncertain
        color: "#F87171"
        description: "Still not sure"

Transcription par segments

Pour les fichiers audio longs, vous pouvez préparer vos données sous forme de segments avec des informations de chronométrage :

yaml

data_files:
  - "data/segments.json"
 
item_properties:
  id_key: id
  text_key: asr_text
 
annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
 
  - annotation_type: text
    name: transcript
    multiline: true
    description: "Correct the transcript for this segment"

Format des données avec chronométrage des segments :

json

{
  "id": "seg_001",
  "audio_path": "/audio/long_recording.wav",
  "start_time": 0.0,
  "end_time": 5.5,
  "asr_text": "Welcome to today's presentation"
}

Format de sortie

json

{
  "id": "audio_001",
  "audio_path": "/audio/recording_001.wav",
  "original_transcript": "Hello how are you doing today",
  "annotations": {
    "transcript": "Hello, how are you doing today?",
    "num_speakers": "1 speaker",
    "quality": "Good",
    "issues": ["None"],
    "confidence": 5
  },
  "annotator": "transcriber_01",
  "time_spent_seconds": 45
}

Contrôle qualité

Potato suit automatiquement le temps d'annotation. Pour le contrôle qualité, envisagez d'inclure des éléments de vérification d'attention dans votre fichier de données - des éléments avec des réponses correctes connues que vous pouvez utiliser pour vérifier la précision des annotateurs.

Vous pouvez configurer les paramètres de sortie pour suivre les annotations :

yaml

output_annotation_dir: "annotation_output"
export_annotation_format: "json"

Conseils pour les tâches de transcription

Bons écouteurs : Essentiels pour la précision
Environnement calme : Réduit la fatigue
Ajustement de la vitesse : Ralentissez pour les sections difficiles
Passes multiples : Écoutez une fois, transcrivez, puis vérifiez
Pauses régulières : La transcription est mentalement exigeante

Prochaines étapes

Ajoutez la diarisation des locuteurs pour l'audio multi-locuteurs
Mettez en place la classification des émotions en complément de la transcription
Configurez le crowdsourcing pour la transcription à grande échelle

Documentation audio complète sur /docs/features/audio-annotation.