좋은 ASR 학습 데이터는 보통 사람이 기계의 초안을 확인하는 데서 시작합니다. 이 튜토리얼은 어노테이터가 오디오를 듣고 파형을 보면서 기계가 생성한 전사본을 교정하는 인터페이스를 만드는 방법을 보여줍니다. 이 인터페이스가 의존하는 오디오 옵션은 오디오 어노테이션 문서를 참고하십시오.

무엇을 만드는가

다음을 갖춘 인터페이스입니다:

파형 시각화
재생 컨트롤 (재생, 일시 정지, 속도 조절)
편집 가능한 전사 텍스트
오디오 품질 평가
불확실한 세그먼트에 대한 신뢰도 표시

기본 구성

yaml

annotation_task_name: "Transcription Review"
 
data_files:
  - "data/transcripts.json"
 
item_properties:
  id_key: id
  text_key: asr_transcript
 
annotation_schemes:
  # Audio playback
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
 
  # Corrected transcript
  - annotation_type: text
    name: corrected_transcript
    description: "Edit the transcript to match what you hear"
    multiline: true
    placeholder: "Type the corrected transcript..."
    required: true
 
  # Quality rating
  - annotation_type: radio
    name: audio_quality
    description: "Rate the audio quality"
    labels:
      - Clear
      - Slightly noisy
      - Very noisy
      - Unintelligible

샘플 데이터 형식

data/transcripts.json을 생성합니다:

json

{"id": "audio_001", "audio_path": "/audio/recording_001.wav", "asr_transcript": "Hello how are you doing today"}
{"id": "audio_002", "audio_path": "/audio/recording_002.wav", "asr_transcript": "The weather is nice outside"}
{"id": "audio_003", "audio_path": "/audio/recording_003.wav", "asr_transcript": "Please call me back when your free"}

오디오 어노테이션 설정

Potato의 오디오 어노테이션은 어노테이션 스킴 안에서 audio_annotation 타입을 사용합니다. 플레이어가 파형을 그리고 재생 컨트롤을 자체적으로 추가하므로, 그것들을 직접 연결할 필요가 없습니다:

yaml

annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
    description: "Listen to the audio recording"

오디오 플레이어에는 재생/일시 정지, 탐색, 속도 조절을 위한 내장 컨트롤이 포함되어 있습니다.

종합 전사 인터페이스

yaml

annotation_task_name: "ASR Correction and Annotation"
 
data_files:
  - "data/asr_output.json"
 
item_properties:
  id_key: id
  text_key: hypothesis
 
annotation_schemes:
  # Audio player
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_url
 
  # Main transcript correction
  - annotation_type: text
    name: transcript
    description: "Correct the transcript below"
    multiline: true
    rows: 4
    required: true
 
  # Speaker identification
  - annotation_type: radio
    name: num_speakers
    description: "How many speakers are in this recording?"
    labels:
      - "1 speaker"
      - "2 speakers"
      - "3+ speakers"
      - "Cannot determine"
 
  # Audio quality
  - annotation_type: radio
    name: quality
    description: "Overall audio quality"
    labels:
      - name: Excellent
        description: "Crystal clear, studio quality"
      - name: Good
        description: "Clear speech, minor background noise"
      - name: Fair
        description: "Understandable but noisy"
      - name: Poor
        description: "Very difficult to understand"
      - name: Unusable
        description: "Cannot transcribe accurately"
 
  # Issues checklist
  - annotation_type: multiselect
    name: issues
    description: "Select all issues present (if any)"
    labels:
      - Background noise
      - Overlapping speech
      - Accented speech
      - Fast speech
      - Mumbling/unclear
      - Technical audio issues
      - Non-English words
      - Profanity present
      - None
 
  # Confidence
  - annotation_type: likert
    name: confidence
    description: "How confident are you in your transcription?"
    size: 5
    min_label: "Guessing"
    max_label: "Certain"
 
annotation_guidelines:
  title: "Transcription Guidelines"
  content: |
    ## Your Task
    Listen to the audio and correct the ASR transcript.
 
    ## Transcription Rules
    - Transcribe exactly what is said
    - Include filler words (um, uh, like)
    - Use proper punctuation and capitalization
    - Mark unintelligible sections with [unintelligible]
    - Mark uncertain words with [word?]
 
    ## Special Notations
    - [unintelligible] - Cannot understand
    - [word?] - Uncertain about word
    - [crosstalk] - Overlapping speech
    - [noise] - Non-speech sound
    - [pause] - Significant silence

단어 단위 어노테이션

세밀한 단어 단위 교정을 위해 텍스트 필드와 함께 스팬 어노테이션을 사용할 수 있습니다:

yaml

annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
 
  - annotation_type: text
    name: transcript
    multiline: true
 
  - annotation_type: span
    name: word_corrections
    description: "Mark words that needed correction"
    source_field: transcript
    labels:
      - name: corrected
        color: "#FCD34D"
        description: "Word was changed"
      - name: inserted
        color: "#4ADE80"
        description: "Word was added"
      - name: uncertain
        color: "#F87171"
        description: "Still not sure"

세그먼트 기반 전사

긴 오디오 파일의 경우, 타이밍 정보가 포함된 세그먼트로 데이터를 준비할 수 있습니다:

yaml

data_files:
  - "data/segments.json"
 
item_properties:
  id_key: id
  text_key: asr_text
 
annotation_schemes:
  - annotation_type: audio_annotation
    name: audio_player
    audio_key: audio_path
 
  - annotation_type: text
    name: transcript
    multiline: true
    description: "Correct the transcript for this segment"

세그먼트 타이밍이 포함된 데이터 형식:

json

{
  "id": "seg_001",
  "audio_path": "/audio/long_recording.wav",
  "start_time": 0.0,
  "end_time": 5.5,
  "asr_text": "Welcome to today's presentation"
}

출력 형식

json

{
  "id": "audio_001",
  "audio_path": "/audio/recording_001.wav",
  "original_transcript": "Hello how are you doing today",
  "annotations": {
    "transcript": "Hello, how are you doing today?",
    "num_speakers": "1 speaker",
    "quality": "Good",
    "issues": ["None"],
    "confidence": 5
  },
  "annotator": "transcriber_01",
  "time_spent_seconds": 45
}

품질 관리

Potato는 어노테이션 시간을 자동으로 추적합니다. 품질 관리를 위해 몇 개의 주의 확인 항목을 데이터 파일에 섞어 넣으십시오. 정답이 알려진 클립을 사용하면 실제로 듣지 않는 어노테이터를 가려낼 수 있습니다.

어노테이션이 어디에 어떻게 기록되는지 구성할 수 있습니다:

yaml

output_annotation_dir: "annotation_output"
export_annotation_format: "json"

전사 작업 팁

괜찮은 헤드폰과 조용한 방이 정확도의 대부분을 책임집니다. 잘 들리지 않는 부분은 오디오 속도를 늦추고, 한 번 이상의 패스를 계획하십시오. 듣고, 전사하고, 다시 돌아가 검증하는 것입니다. 전사는 정신적으로 소모적인 작업이므로 규칙적인 휴식을 마련하십시오.

다음 단계

다중 화자 오디오를 위해 화자 분리를 추가하십시오
전사와 함께 감정 분류를 설정하십시오
대규모 전사를 위해 크라우드소싱을 구성하십시오

전체 오디오 문서는 /docs/features/audio-annotation에 있습니다.