Tutorials4 min read
音声イベント検出とタグ付け
タイムスタンプスパンを使用した音声、音楽、拍手、環境音などの特定の音の検出アノテーションの設定方法。
Potato Team·
音声イベント検出とタグ付け
音声イベント検出は、録音内の特定の音を識別します - 音声や音楽から環境音や音響イベントまで。このチュートリアルでは、音認識モデルのトレーニングのためのタイムスタンプベースのアノテーションについて説明します。
音声イベントアノテーションの種類
- クリップレベルタグ付け:音声クリップ全体にラベルを付ける
- 時間的検出:イベントの開始/終了時間をマーク
- 強ラベル付け:各イベントの正確なタイムスタンプ
- 弱ラベル付け:タイムスタンプなしの存在/不在
クリップレベルのサウンドタグ付け
単一イベントの短いクリップの場合:
yaml
annotation_task_name: "Sound Event Classification"
data_files:
- data/audio_clips.json
item_properties:
audio_path: audio_path
annotation_schemes:
- annotation_type: audio_annotation
audio_display: waveform
waveform_color: "#10B981"
progress_color: "#34D399"
name: sound_class
description: "What sound is in this clip?"
labels:
- Dog bark
- Car horn
- Siren
- Music
- Speech
- Footsteps
- Door knock
- Glass breaking
- Gunshot
- Baby cry
- Other
- Silence/noise only時間的サウンドイベント検出
イベントの発生時刻をマーク:
yaml
annotation_task_name: "Sound Event Detection"
data_files:
- data/recordings.json
item_properties:
audio_path: audio_path
annotation_schemes:
- annotation_type: audio_annotation
audio_display: waveform
height: 150
waveform_color: "#6366F1"
progress_color: "#A5B4FC"
show_timestamps: true
enable_regions: true
speed_control: true
name: events
description: "Mark all sound events with timestamps"
labels:
- name: speech
color: "#3B82F6"
- name: music
color: "#8B5CF6"
- name: vehicle
color: "#EF4444"
- name: animal
color: "#F59E0B"
- name: nature
color: "#10B981"
- name: mechanical
color: "#6B7280"
allow_overlap: true
min_duration: 0.1完全な音声イベント設定
yaml
annotation_task_name: "AudioSet-Style Event Detection"
data_files:
- data/audio_10sec.json
item_properties:
audio_path: audio_url
annotation_schemes:
# Temporal event marking with audio playback
- annotation_type: audio_annotation
audio_display: waveform
waveform_color: "#059669"
progress_color: "#34D399"
cursor_color: "#F59E0B"
height: 128
show_timestamps: true
time_format: "ss.ms"
show_duration: true
speed_control: true
speed_options: [0.5, 0.75, 1.0, 1.5]
enable_regions: true
region_snap: 0.05
name: sound_events
description: "Mark all distinct sound events"
labels:
# Human sounds
- name: Speech
color: "#3B82F6"
keyboard_shortcut: "1"
category: human
- name: Singing
color: "#8B5CF6"
keyboard_shortcut: "2"
category: human
- name: Laughter
color: "#EC4899"
category: human
- name: Cough/Sneeze
color: "#F472B6"
category: human
# Music
- name: Music
color: "#A855F7"
keyboard_shortcut: "m"
category: music
- name: Musical instrument
color: "#7C3AED"
category: music
# Animals
- name: Dog
color: "#F59E0B"
keyboard_shortcut: "d"
category: animal
- name: Cat
color: "#FBBF24"
category: animal
- name: Bird
color: "#FCD34D"
category: animal
# Vehicles
- name: Car
color: "#EF4444"
keyboard_shortcut: "c"
category: vehicle
- name: Motorcycle
color: "#DC2626"
category: vehicle
- name: Siren
color: "#B91C1C"
category: vehicle
- name: Aircraft
color: "#991B1B"
category: vehicle
# Environment
- name: Rain
color: "#06B6D4"
category: nature
- name: Thunder
color: "#0891B2"
category: nature
- name: Wind
color: "#0E7490"
category: nature
- name: Water
color: "#0D9488"
category: nature
# Domestic
- name: Door
color: "#84CC16"
category: domestic
- name: Alarm
color: "#65A30D"
category: domestic
- name: Appliance
color: "#4D7C0F"
category: domestic
# Other
- name: Noise/Unknown
color: "#6B7280"
keyboard_shortcut: "n"
category: other
allow_overlap: true
min_duration: 0.1
show_labels_on_waveform: true
# Segment attributes
segment_attributes:
- name: confidence
type: radio
options: [Clear, Moderate, Faint]
- name: foreground
type: checkbox
description: "Is this the main/foreground sound?"
# Clip-level tags (weak labels)
- annotation_type: multiselect
name: clip_tags
description: "What sounds are present anywhere in this clip?"
labels:
- Speech
- Music
- Vehicle sounds
- Animal sounds
- Nature sounds
- Domestic sounds
- Silence
min_selections: 1
# Audio quality
- annotation_type: radio
name: quality
description: "Recording quality"
labels:
- Clean (clear sounds)
- Moderate noise
- Very noisy
- Distorted/clipped
annotation_guidelines:
title: "Sound Event Detection Guide"
content: |
## Your Task
Mark the START and END times of each distinct sound event.
## Event Detection Rules
- Mark sounds that are clearly audible
- Include overlapping sounds (use multiple labels)
- Short sounds (<100ms) may be a single point
## Segment Boundaries
- Start: When sound becomes audible
- End: When sound fades or stops
## Confidence Levels
- Clear: Easily identifiable
- Moderate: Reasonably sure
- Faint: Background, hard to identify
## Foreground vs Background
- Foreground: Main focus of audio
- Background: Ambient sounds
出力形式
json
{
"id": "clip_001",
"audio_url": "/audio/street_scene.wav",
"duration": 10.0,
"annotations": {
"sound_events": [
{
"label": "Speech",
"start": 0.5,
"end": 3.2,
"attributes": {
"confidence": "Clear",
"foreground": true
}
},
{
"label": "Car",
"start": 1.8,
"end": 4.5,
"attributes": {
"confidence": "Moderate",
"foreground": false
}
},
{
"label": "Dog",
"start": 6.1,
"end": 6.8,
"attributes": {
"confidence": "Clear",
"foreground": true
}
}
],
"clip_tags": ["Speech", "Vehicle sounds", "Animal sounds"],
"quality": "Moderate noise"
}
}検出器による事前アノテーション
モデル予測を出発点として使用:
yaml
pre_annotation:
enabled: true
field: detected_events
show_confidence: true
confidence_threshold: 0.3
allow_modification: true音声イベントアノテーションのヒント
- 良いヘッドフォン:微妙な音を検出するために不可欠
- 静かな環境:背景ノイズが知覚に影響
- 複数パス:最初のパスで特定、2回目でタイムスタンプを調整
- 低速再生:正確な境界には0.5倍速を使用
- 一貫した基準:「聞こえる」閾値を明確に定義
次のステップ
- 音楽コンテンツに音楽分類を追加
- 音声の話者ダイアライゼーションを学ぶ
- イベント検出の品質管理を設定
音声の完全なドキュメントは/docs/features/audio-annotationをご覧ください。