話者ダイアライゼーションは「誰がいつ話したか？」という質問に答えます。本チュートリアルでは、話者ターンのアノテーション、自動ダイアライゼーションの修正、複数話者の会話の処理のためのインターフェース構築を紹介します。

話者ダイアライゼーションとは？

話者ダイアライゼーションは、オーディオを話者ごとに均質な領域にセグメント化します。応用分野には以下があります：

会議の書き起こし
コールセンター分析
ポッドキャスト制作
インタビュー処理
裁判/法律録音

基本的なダイアライゼーションの設定

yaml

annotation_task_name: "Speaker Diarization"
 
data_files:
  - "data/conversations.json"
 
annotation_schemes:
  - annotation_type: audio_annotation
    name: speakers
    description: "Mark when each speaker talks"
    labels:
      - name: Speaker 1
        color: "#FF6B6B"
        keyboard_shortcut: "1"
      - name: Speaker 2
        color: "#4ECDC4"
        keyboard_shortcut: "2"
      - name: Speaker 3
        color: "#45B7D1"
        keyboard_shortcut: "3"
      - name: Overlap
        color: "#FFEAA7"
        keyboard_shortcut: "o"
      - name: Silence
        color: "#9CA3AF"
        keyboard_shortcut: "s"

話者セグメントの作成

ワークフロー

オーディオを再生するか、波形をクリックしてナビゲートする
波形上でクリック＆ドラッグして時間範囲を選択する
数字キーを押すか話者ラベルをクリックする
セグメントに色とラベルが付く
エッジをドラッグして境界を調整する
オーディオ全体がセグメント化されるまで続ける

キーボード操作

Potatoは再生コントロールやナビゲーションを含むオーディオ再生制御用の組み込みキーボードショートカットを提供しています。

事前アノテーション済みダイアライゼーションの修正

多くの場合、自動ダイアライゼーションを修正することになります：

yaml

data_files:
  - "data/auto_diarized.json"

データフォーマット：

json

{
  "id": "meeting_001",
  "audio_path": "/audio/meeting_001.wav",
  "auto_segments": [
    {"start": 0.0, "end": 3.5, "speaker": "Speaker 1"},
    {"start": 3.5, "end": 8.2, "speaker": "Speaker 2"},
    {"start": 8.2, "end": 12.0, "speaker": "Speaker 1"}
  ]
}

詳細な話者情報

追加の話者メタデータを取得します：

yaml

annotation_schemes:
  - annotation_type: audio_annotation
    name: speakers
    labels:
      - name: Speaker A
        color: "#FF6B6B"
      - name: Speaker B
        color: "#4ECDC4"
      - name: Speaker C
        color: "#45B7D1"
      - name: Unknown
        color: "#9CA3AF"
 
  # Speaker characteristics
  - annotation_type: radio
    name: speaker_a_gender
    description: "Speaker A Gender"
    labels:
      - Male
      - Female
      - Unknown
 
  - annotation_type: text
    name: speaker_a_role
    description: "Speaker A Role (if identifiable)"
 
  - annotation_type: radio
    name: speaker_b_gender
    description: "Speaker B Gender"
    labels:
      - Male
      - Female
      - Unknown

オーバーラップする発話の処理

yaml

annotation_schemes:
  - annotation_type: audio_annotation
    name: speakers
    labels:
      - name: Speaker 1
        color: "#FF6B6B"
      - name: Speaker 2
        color: "#4ECDC4"
      - name: Overlap
        color: "#FFEAA7"

会議/インタビューのダイアライゼーション

yaml

annotation_task_name: "Meeting Diarization"
 
data_files:
  - "data/meetings.json"
 
annotation_schemes:
  # Speaker turns
  - annotation_type: audio_annotation
    name: turns
    description: "Mark each speaker turn"
    labels:
      - name: Moderator
        color: "#EF4444"
        keyboard_shortcut: "m"
      - name: Participant 1
        color: "#3B82F6"
        keyboard_shortcut: "1"
      - name: Participant 2
        color: "#10B981"
        keyboard_shortcut: "2"
      - name: Participant 3
        color: "#F59E0B"
        keyboard_shortcut: "3"
      - name: Participant 4
        color: "#8B5CF6"
        keyboard_shortcut: "4"
      - name: Unknown
        color: "#6B7280"
        keyboard_shortcut: "u"
      - name: Overlap
        color: "#FCD34D"
        keyboard_shortcut: "o"
      - name: Silence/Noise
        color: "#D1D5DB"
        keyboard_shortcut: "s"
 
  # Speech type annotation
  - annotation_type: radio
    name: speech_type
    description: "Type of speech"
    labels:
      - Statement
      - Question
      - Response
      - Interruption
      - Backchannel
 
  # Overall quality
  - annotation_type: radio
    name: recording_quality
    description: "Overall recording quality"
    labels:
      - Excellent - All speakers clear
      - Good - Most speech understandable
      - Fair - Some difficulty
      - Poor - Significant issues

出力フォーマット

json

{
  "id": "meeting_001",
  "audio_path": "/audio/meeting_001.wav",
  "annotations": {
    "turns": [
      {
        "start": 0.0,
        "end": 5.2,
        "label": "Moderator",
        "attributes": {
          "speech_type": "Statement"
        }
      },
      {
        "start": 5.2,
        "end": 12.8,
        "label": "Participant 1",
        "attributes": {
          "speech_type": "Response"
        }
      },
      {
        "start": 11.5,
        "end": 12.8,
        "label": "Overlap"
      }
    ],
    "recording_quality": "Good - Most speech understandable"
  }
}

ダイアライゼーションのコツ

まず聴く: アノテーション前に話者に慣れる
話者の特徴を記録する: 声の高さ、アクセント、話し方
オーバーラップは一貫して処理する: 事前に戦略を決める
速度調整を使う: 難しいセクションでは再生速度を落とす
不確実性を記録する: 必要な場合は「Unknown」の使用も可

次のステップ

完全な会議メモのための書き起こしと組み合わせる
話者ごとの感情検出を追加する
複数アノテーター間の一致のための品質管理を設定する

オーディオの完全なドキュメントは/docs/features/audio-annotationをご覧ください。