音频标注

通过波形可视化分割音频文件并为时间区域分配标签。

Potato 的音频标注工具使标注者能够通过基于波形的界面分割音频文件并为时间区域分配标签。

功能

波形可视化
基于时间的片段创建
为片段分配标签
可变速度的播放控制
缩放和滚动导航
键盘快捷键
服务端波形缓存

基本配置

yaml

annotation_schemes:
  - name: "speakers"
    description: "Mark when each speaker is talking"
    annotation_type: "audio_annotation"
    labels:
      - name: "Speaker 1"
        color: "#3B82F6"
      - name: "Speaker 2"
        color: "#10B981"

配置选项

字段	类型	默认值	描述
`name`	string	必填	标注的唯一标识符
`description`	string	必填	显示给标注者的说明
`annotation_type`	string	必填	必须为 `"audio_annotation"`
`mode`	string	`"label"`	标注模式：`"label"`、`"questions"` 或 `"both"`
`labels`	list	条件必填	`label` 或 `both` 模式必填
`segment_schemes`	list	条件必填	`questions` 或 `both` 模式必填
`min_segments`	integer	0	所需的最少片段数
`max_segments`	integer	null	允许的最大片段数（null = 无限制）
`zoom_enabled`	boolean	true	启用缩放控制
`playback_rate_control`	boolean	false	显示播放速度选择器

标签配置

yaml

labels:
  - name: "speech"
    color: "#3B82F6"
    key_value: "1"
  - name: "music"
    color: "#10B981"
    key_value: "2"
  - name: "silence"
    color: "#64748B"
    key_value: "3"

标注模式

标签模式（默认）

片段接收分类标签：

yaml

annotation_schemes:
  - name: "emotion"
    description: "Label the emotion in each segment"
    annotation_type: "audio_annotation"
    mode: "label"
    labels:
      - name: "happy"
        color: "#22C55E"
      - name: "sad"
        color: "#3B82F6"
      - name: "angry"
        color: "#EF4444"
      - name: "neutral"
        color: "#64748B"

问题模式

每个片段回答专门的问题：

yaml

annotation_schemes:
  - name: "transcription"
    description: "Transcribe each segment"
    annotation_type: "audio_annotation"
    mode: "questions"
    segment_schemes:
      - name: "transcript"
        annotation_type: "text"
        description: "Enter the transcription"
      - name: "confidence"
        annotation_type: "likert"
        description: "How confident are you?"
        size: 5

两者兼用模式

将标签与逐片段问卷结合：

yaml

annotation_schemes:
  - name: "detailed_diarization"
    description: "Label speakers and add notes"
    annotation_type: "audio_annotation"
    mode: "both"
    labels:
      - name: "Speaker A"
        color: "#3B82F6"
      - name: "Speaker B"
        color: "#10B981"
    segment_schemes:
      - name: "notes"
        annotation_type: "text"
        description: "Any notes about this segment?"

全局音频配置

在配置文件中配置波形处理：

yaml

audio_annotation:
  waveform_cache_dir: "waveform_cache/"
  waveform_look_ahead: 5
  waveform_cache_max_size: 1000
  client_fallback_max_duration: 1800

字段	描述
`waveform_cache_dir`	缓存波形数据的目录
`waveform_look_ahead`	预先计算的后续实例数量
`waveform_cache_max_size`	缓存波形文件的最大数量
`client_fallback_max_duration`	浏览器端波形生成的最大秒数（默认：1800）

示例

说话人日志

yaml

annotation_schemes:
  - name: "diarization"
    description: "Identify who is speaking at each moment"
    annotation_type: "audio_annotation"
    mode: "label"
    labels:
      - name: "Interviewer"
        color: "#8B5CF6"
        key_value: "1"
      - name: "Guest"
        color: "#EC4899"
        key_value: "2"
      - name: "Overlap"
        color: "#F59E0B"
        key_value: "3"
    zoom_enabled: true
    playback_rate_control: true

声音事件检测

yaml

annotation_schemes:
  - name: "sound_events"
    description: "Mark all sound events"
    annotation_type: "audio_annotation"
    labels:
      - name: "speech"
        color: "#3B82F6"
      - name: "music"
        color: "#10B981"
      - name: "applause"
        color: "#F59E0B"
      - name: "laughter"
        color: "#EC4899"
      - name: "silence"
        color: "#64748B"
    min_segments: 1

转录审核

yaml

annotation_schemes:
  - name: "transcription_review"
    description: "Review and correct the transcription for each segment"
    annotation_type: "audio_annotation"
    mode: "questions"
    segment_schemes:
      - name: "transcript"
        annotation_type: "text"
        description: "Enter or correct the transcription"
        textarea: true
      - name: "quality"
        annotation_type: "radio"
        description: "Audio quality"
        labels:
          - "Clear"
          - "Noisy"
          - "Unintelligible"

键盘快捷键

按键	操作
`Space`	播放/暂停
`←` / `→`	后退/前进
`[`	标记片段起始
`]`	标记片段结束
`Enter`	创建片段
`Delete`	删除选中的片段
`1-9`	选择标签
`+` / `-`	放大/缩小
`0`	适应视图

数据格式

输入数据

您的数据文件应包含音频文件路径或 URL：

json

[
  {
    "id": "audio_001",
    "audio_url": "https://example.com/audio/recording1.mp3"
  },
  {
    "id": "audio_002",
    "audio_url": "/data/audio/recording2.wav"
  }
]

配置音频字段：

yaml

item_properties:
  id_key: id
  text_key: audio_url

输出格式

json

{
  "id": "audio_001",
  "annotations": {
    "diarization": [
      {
        "start": 0.0,
        "end": 5.5,
        "label": "Interviewer"
      },
      {
        "start": 5.5,
        "end": 12.3,
        "label": "Guest"
      },
      {
        "start": 12.3,
        "end": 14.0,
        "label": "Overlap"
      }
    ]
  }
}

在问题模式下，片段包含嵌套的回复：

json

{
  "start": 0.0,
  "end": 5.5,
  "transcript": "Hello and welcome to the show.",
  "quality": "Clear"
}

支持的音频格式

MP3（推荐）
WAV
OGG
M4A

最佳实践

预缓存波形 - 对大型数据集使用服务端缓存
启用播放控制 - 可变速度有助于精确分割
使用键盘快捷键 - 比点击快得多
定义清晰的边界 - 明确什么构成片段的起始/结束
选择合适的模式 - 分类使用 "label"，详细标注使用 "questions"
设置片段限制 - 使用 min_segments 确保覆盖