音声アノテーション

波形表示と再生コントロールを使用して音声ファイルにアノテーションを付ける。

Potato 2.0は、Peaks.jsによる波形表示、セグメントラベリング、包括的なキーボードショートカットを備えた強力な音声アノテーション機能を提供します。

ユースケース

音声の書き起こしとレビュー
話者ダイアライゼーション
音楽分析
音声イベント検出
音声からの感情認識
コールセンター品質保証

音声サポートの有効化

設定にaudio_annotationセクションを追加します：

yaml

annotation_schemes:
  - annotation_type: audio
    name: audio_segments
    description: "Segment and label the audio"
    labels:
      - Speech
      - Music
      - Silence
      - Noise

動作モード

Potatoは3つの音声アノテーションモードをサポートしています：

ラベルモード

音声をセグメント化し、各セグメントにカテゴリラベルを割り当てる：

yaml

annotation_schemes:
  - annotation_type: audio
    name: speaker_diarization
    mode: label
    description: "Identify speakers in the audio"
    labels:
      - Speaker A
      - Speaker B
      - Overlap
    label_colors:
      "Speaker A": "#3b82f6"
      "Speaker B": "#10b981"
      "Overlap": "#f59e0b"

質問モード

セグメントごとのアノテーション質問を追加する：

yaml

annotation_schemes:
  - annotation_type: audio
    name: speech_quality
    mode: questions
    description: "Evaluate speech segments"
    segment_questions:
      - name: clarity
        type: likert
        size: 5
        min_label: "Unclear"
        max_label: "Very clear"
      - name: emotion
        type: radio
        labels: [Neutral, Happy, Sad, Angry]

両方モード

ラベリングとセグメントごとの質問を組み合わせる：

yaml

annotation_schemes:
  - annotation_type: audio
    name: full_analysis
    mode: both
    description: "Label and analyze audio segments"
    labels:
      - Speech
      - Music
      - Noise
    segment_questions:
      - name: quality
        type: likert
        size: 5

設定オプション

基本設定

yaml

annotation_schemes:
  - annotation_type: audio
    name: segments
    description: "Create audio segments"
    labels:
      - Label A
      - Label B
 
    # Optional constraints
    min_segments: 1
    max_segments: 50

キーボードショートカット

ラベルは数字キー1-9を使用して割り当てることができます：

yaml

annotation_schemes:
  - annotation_type: audio
    name: speakers
    labels:
      - Speaker A  # Press 1
      - Speaker B  # Press 2
      - Overlap    # Press 3

ラベルカラー

セグメントカラーのカスタマイズ：

yaml

annotation_schemes:
  - annotation_type: audio
    name: segments
    labels:
      - Speech
      - Music
      - Silence
    label_colors:
      "Speech": "#3b82f6"
      "Music": "#10b981"
      "Silence": "#6b7280"

波形パフォーマンス

長時間の音声ファイルに対して最適なパフォーマンスを得るには、BBC audiowaveformツールをインストールしてください：

bash

# macOS
brew install audiowaveform
 
# Ubuntu/Debian
sudo apt-get install audiowaveform
 
# Or build from source
# https://github.com/bbc/audiowaveform

これによりサーバー側の波形生成が有効になります。これがない場合、クライアント側の生成が使用されます（30分以下のファイルに適しています）。

波形キャッシュ

パフォーマンス向上のためにキャッシュを設定する：

yaml

audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 100  # Pre-generate waveforms for first N items
  client_fallback_max_duration: 1800  # 30 minutes in seconds

データ形式

シンプルな音声参照

json

[
  {"id": "1", "audio_path": "audio/recording_001.wav"},
  {"id": "2", "audio_path": "audio/recording_002.wav"}
]

yaml

data_files:
  - "data/audio_data.json"
 
item_properties:
  id_key: id
  audio_key: audio_path

書き起こし付き

json

[
  {
    "id": "1",
    "audio_path": "audio/call_001.wav",
    "transcript": "Hello, how can I help you today?"
  }
]

出力形式

アノテーションはセグメントのタイムスタンプ付きで保存されます：

json

{
  "id": "audio_1",
  "annotations": {
    "segments": [
      {
        "start": 0.0,
        "end": 2.5,
        "label": "Speaker A",
        "questions": {
          "clarity": 4,
          "emotion": "Neutral"
        }
      },
      {
        "start": 2.5,
        "end": 5.2,
        "label": "Speaker B"
      }
    ]
  }
}

キーボードショートカット

Potatoは効率的なアノテーションのための豊富なキーボードショートカットを提供します：

ショートカット	アクション
`Space`	再生/一時停止
`[`	現在位置にセグメント開始を設定
`]`	現在位置にセグメント終了を設定
`1-9`	現在のセグメントにラベルを割り当て
`Delete`	現在のセグメントを削除
`Left Arrow`	5秒戻る
`Right Arrow`	5秒進む
`Up Arrow`	ズームイン
`Down Arrow`	ズームアウト
`Home`	先頭に移動
`End`	末尾に移動
`+`	再生速度を上げる
`-`	再生速度を下げる

設定例

話者ダイアライゼーション

yaml

task_name: "Speaker Diarization"
task_dir: "."
port: 8000
 
data_files:
  - "data/recordings.json"
 
item_properties:
  id_key: id
  audio_key: audio_path
 
annotation_schemes:
  - annotation_type: audio
    name: speakers
    mode: label
    description: "Identify who is speaking"
    labels:
      - Speaker 1
      - Speaker 2
      - Speaker 3
      - Overlap
      - Silence
    label_colors:
      "Speaker 1": "#3b82f6"
      "Speaker 2": "#10b981"
      "Speaker 3": "#f59e0b"
      "Overlap": "#ef4444"
      "Silence": "#6b7280"
    min_segments: 1
 
audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 50
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

書き起こしレビュー

yaml

task_name: "Transcription Quality Review"
task_dir: "."
port: 8000
 
data_files:
  - "data/transcripts.json"
 
item_properties:
  id_key: id
  text_key: transcript
  audio_key: audio_path
 
annotation_schemes:
  - annotation_type: audio
    name: errors
    mode: questions
    description: "Mark transcription errors"
    segment_questions:
      - name: error_type
        type: radio
        labels:
          - Missing word
          - Wrong word
          - Extra word
          - Spelling error
      - name: severity
        type: likert
        size: 3
        min_label: "Minor"
        max_label: "Major"
 
  - annotation_type: radio
    name: overall_accuracy
    description: "Overall transcript accuracy"
    labels:
      - Accurate
      - Minor errors
      - Major errors
      - Unusable
 
output_annotation_dir: "output/"
output_annotation_format: "json"

コールセンターQA

yaml

task_name: "Call Center Quality Assurance"
task_dir: "."
port: 8000
 
data_files:
  - "data/calls.json"
 
item_properties:
  id_key: call_id
  audio_key: recording_path
 
annotation_schemes:
  # Segment-level annotation
  - annotation_type: audio
    name: conversation
    mode: both
    description: "Segment the conversation"
    labels:
      - Agent
      - Customer
      - Hold
      - Silence
    segment_questions:
      - name: sentiment
        type: radio
        labels: [Positive, Neutral, Negative, Frustrated]
 
  # Call-level assessment
  - annotation_type: likert
    name: professionalism
    description: "Agent professionalism"
    size: 5
    min_label: "Poor"
    max_label: "Excellent"
 
  - annotation_type: likert
    name: resolution
    description: "Issue resolution"
    size: 5
    min_label: "Unresolved"
    max_label: "Fully resolved"
 
  - annotation_type: multiselect
    name: issues
    description: "Select any issues observed"
    labels:
      - Long hold time
      - Agent interrupted
      - Incorrect information
      - Missing greeting
      - Unprofessional language
 
  - annotation_type: text
    name: notes
    description: "Additional observations"
    textarea: true
 
output_annotation_dir: "output/"
output_annotation_format: "json"

サポートされる音声形式

WAV（最高品質のため推奨）
MP3
OGG
FLAC
M4A
WebM

パフォーマンスのヒント

audiowaveformをインストールする - 長時間の音声ファイルに必須
キャッシュを有効にする - cache_dirを使用して事前生成した波形を保存する
品質にはWAVを使用する - 圧縮形式はアーティファクトを導入する可能性がある
音声を前処理する - レベルを正規化し、不要な無音をトリミングする
ファイルサイズに注意する - 大きなファイルは読み込みを遅くする
precomputeを使用する - 最初のインスタンスに対して波形を事前生成する

トラブルシューティング

波形が読み込まれない

音声ファイルのパスが正しいか確認する
ファイル形式がサポートされているか確認する
長いファイルにはaudiowaveformをインストールする
ブラウザコンソールでエラーを確認する

パフォーマンスが遅い

audiowaveformツールをインストールする
波形キャッシュを有効にする
音声ファイルサイズを縮小する
precompute_depth設定を使用する

セグメントが保存されない

出力ディレクトリが書き込み可能であることを確認する
アノテーション形式の設定を確認する
セグメントに開始時間と終了時間の両方があることを確認する