音频标注

使用波形可视化和播放控件标注音频文件。

Potato 2.0 提供强大的音频标注功能，包括由 Peaks.js 驱动的波形可视化、片段标注和全面的键盘快捷键。

用例

语音转录和审核
说话人分离
音乐分析
音频事件检测
语音情感识别
呼叫中心质量保证

启用音频支持

在配置中添加 audio_annotation 部分：

yaml

annotation_schemes:
  - annotation_type: audio
    name: audio_segments
    description: "Segment and label the audio"
    labels:
      - Speech
      - Music
      - Silence
      - Noise

操作模式

Potato 支持三种音频标注模式：

标签模式

分割音频并为每个片段分配类别标签：

yaml

annotation_schemes:
  - annotation_type: audio
    name: speaker_diarization
    mode: label
    description: "Identify speakers in the audio"
    labels:
      - Speaker A
      - Speaker B
      - Overlap
    label_colors:
      "Speaker A": "#3b82f6"
      "Speaker B": "#10b981"
      "Overlap": "#f59e0b"

问题模式

为每个片段添加标注问题：

yaml

annotation_schemes:
  - annotation_type: audio
    name: speech_quality
    mode: questions
    description: "Evaluate speech segments"
    segment_questions:
      - name: clarity
        type: likert
        size: 5
        min_label: "Unclear"
        max_label: "Very clear"
      - name: emotion
        type: radio
        labels: [Neutral, Happy, Sad, Angry]

混合模式

将标签与每片段问题结合：

yaml

annotation_schemes:
  - annotation_type: audio
    name: full_analysis
    mode: both
    description: "Label and analyze audio segments"
    labels:
      - Speech
      - Music
      - Noise
    segment_questions:
      - name: quality
        type: likert
        size: 5

配置选项

基本设置

yaml

annotation_schemes:
  - annotation_type: audio
    name: segments
    description: "Create audio segments"
    labels:
      - Label A
      - Label B
 
    # Optional constraints
    min_segments: 1
    max_segments: 50

键盘快捷键

标签可以使用数字键 1-9 分配：

yaml

annotation_schemes:
  - annotation_type: audio
    name: speakers
    labels:
      - Speaker A  # Press 1
      - Speaker B  # Press 2
      - Overlap    # Press 3

标签颜色

自定义片段颜色：

yaml

annotation_schemes:
  - annotation_type: audio
    name: segments
    labels:
      - Speech
      - Music
      - Silence
    label_colors:
      "Speech": "#3b82f6"
      "Music": "#10b981"
      "Silence": "#6b7280"

波形性能

为了在长音频文件中获得最佳性能，请安装 BBC audiowaveform 工具：

bash

# macOS
brew install audiowaveform
 
# Ubuntu/Debian
sudo apt-get install audiowaveform
 
# Or build from source
# https://github.com/bbc/audiowaveform

这将启用服务器端波形生成。没有它，将使用客户端生成（适用于 30 分钟以内的文件）。

波形缓存

配置缓存以获得更好的性能：

yaml

audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 100  # Pre-generate waveforms for first N items
  client_fallback_max_duration: 1800  # 30 minutes in seconds

数据格式

简单音频引用

json

[
  {"id": "1", "audio_path": "audio/recording_001.wav"},
  {"id": "2", "audio_path": "audio/recording_002.wav"}
]

yaml

data_files:
  - "data/audio_data.json"
 
item_properties:
  id_key: id
  audio_key: audio_path

带转录文本

json

[
  {
    "id": "1",
    "audio_path": "audio/call_001.wav",
    "transcript": "Hello, how can I help you today?"
  }
]

输出格式

标注与片段时间戳一起保存：

json

{
  "id": "audio_1",
  "annotations": {
    "segments": [
      {
        "start": 0.0,
        "end": 2.5,
        "label": "Speaker A",
        "questions": {
          "clarity": 4,
          "emotion": "Neutral"
        }
      },
      {
        "start": 2.5,
        "end": 5.2,
        "label": "Speaker B"
      }
    ]
  }
}

键盘快捷键

Potato 提供丰富的键盘快捷键以实现高效标注：

快捷键	操作
`Space`	播放/暂停
`[`	在当前位置设置片段起点
`]`	在当前位置设置片段终点
`1-9`	为当前片段分配标签
`Delete`	删除当前片段
`Left Arrow`	后退 5 秒
`Right Arrow`	前进 5 秒
`Up Arrow`	放大
`Down Arrow`	缩小
`Home`	跳到开头
`End`	跳到结尾
`+`	加快播放速度
`-`	减慢播放速度

示例配置

说话人分离

yaml

task_name: "Speaker Diarization"
task_dir: "."
port: 8000
 
data_files:
  - "data/recordings.json"
 
item_properties:
  id_key: id
  audio_key: audio_path
 
annotation_schemes:
  - annotation_type: audio
    name: speakers
    mode: label
    description: "Identify who is speaking"
    labels:
      - Speaker 1
      - Speaker 2
      - Speaker 3
      - Overlap
      - Silence
    label_colors:
      "Speaker 1": "#3b82f6"
      "Speaker 2": "#10b981"
      "Speaker 3": "#f59e0b"
      "Overlap": "#ef4444"
      "Silence": "#6b7280"
    min_segments: 1
 
audio_config:
  cache_dir: "audio_cache/"
  precompute_depth: 50
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

转录审核

yaml

task_name: "Transcription Quality Review"
task_dir: "."
port: 8000
 
data_files:
  - "data/transcripts.json"
 
item_properties:
  id_key: id
  text_key: transcript
  audio_key: audio_path
 
annotation_schemes:
  - annotation_type: audio
    name: errors
    mode: questions
    description: "Mark transcription errors"
    segment_questions:
      - name: error_type
        type: radio
        labels:
          - Missing word
          - Wrong word
          - Extra word
          - Spelling error
      - name: severity
        type: likert
        size: 3
        min_label: "Minor"
        max_label: "Major"
 
  - annotation_type: radio
    name: overall_accuracy
    description: "Overall transcript accuracy"
    labels:
      - Accurate
      - Minor errors
      - Major errors
      - Unusable
 
output_annotation_dir: "output/"
output_annotation_format: "json"

呼叫中心质量保证

yaml

task_name: "Call Center Quality Assurance"
task_dir: "."
port: 8000
 
data_files:
  - "data/calls.json"
 
item_properties:
  id_key: call_id
  audio_key: recording_path
 
annotation_schemes:
  # Segment-level annotation
  - annotation_type: audio
    name: conversation
    mode: both
    description: "Segment the conversation"
    labels:
      - Agent
      - Customer
      - Hold
      - Silence
    segment_questions:
      - name: sentiment
        type: radio
        labels: [Positive, Neutral, Negative, Frustrated]
 
  # Call-level assessment
  - annotation_type: likert
    name: professionalism
    description: "Agent professionalism"
    size: 5
    min_label: "Poor"
    max_label: "Excellent"
 
  - annotation_type: likert
    name: resolution
    description: "Issue resolution"
    size: 5
    min_label: "Unresolved"
    max_label: "Fully resolved"
 
  - annotation_type: multiselect
    name: issues
    description: "Select any issues observed"
    labels:
      - Long hold time
      - Agent interrupted
      - Incorrect information
      - Missing greeting
      - Unprofessional language
 
  - annotation_type: text
    name: notes
    description: "Additional observations"
    textarea: true
 
output_annotation_dir: "output/"
output_annotation_format: "json"

支持的音频格式

WAV（推荐，质量最佳）
MP3
OGG
FLAC
M4A
WebM

性能提示

安装 audiowaveform - 对长音频文件至关重要
启用缓存 - 使用 cache_dir 存储预生成的波形
使用 WAV 以保证质量 - 压缩格式可能引入伪影
预处理音频 - 标准化音量、裁剪不必要的静音
注意文件大小 - 大文件会减慢加载速度
使用预计算 - 为初始实例预生成波形

故障排除

波形未加载

检查音频文件路径是否正确
验证文件格式是否支持
为长文件安装 audiowaveform
检查浏览器控制台的错误信息

性能缓慢

安装 audiowaveform 工具
启用波形缓存
减小音频文件大小
使用 precompute_depth 设置

片段未保存

确保输出目录可写
检查标注格式配置
验证片段有起始和结束时间