음성 감정 인식(SER)은 가상 비서, 정신 건강 도구, 콜센터 분석 등에 등장하며, 이 모든 것은 학습에 사용할 라벨링된 오디오를 필요로 합니다. 이 튜토리얼은 범주형 감정, 차원 평가, 그리고 둘 이상의 감정이 함께 나타나는 클립을 위한 어노테이션 인터페이스를 살펴봅니다. 기반이 되는 오디오 옵션은 오디오 어노테이션 문서를 참고하십시오.

감정 어노테이션 접근법

음성 감정을 라벨링하는 몇 가지 일반적인 방법이 있습니다. 행복, 슬픔, 분노 같은 이산 범주를 사용할 수 있습니다. 정서가(valence), 각성도(arousal), 지배성(dominance) 같은 연속 차원을 평가할 수 있습니다. 어노테이터가 강도 평가와 함께 여러 감정을 한 번에 표시하도록 할 수도 있습니다. 또는 더 긴 클립의 경우 서로 다른 시간 지점에서 서로 다른 감정을 태그할 수도 있습니다.

범주형 감정 분류

기본 설정

yaml

annotation_task_name: "Speech Emotion Recognition"
 
data_files:
  - data/utterances.json
 
item_properties:
  id_key: id
  audio_key: audio_path
  text_key: transcript  # Optional transcript
 
audio:
  enabled: true
  display: waveform
  waveform_color: "#8B5CF6"
  progress_color: "#A78BFA"
  speed_control: true
  speed_options: [0.75, 1.0, 1.25]
 
annotation_schemes:
  - annotation_type: radio
    name: emotion
    description: "What emotion is expressed in this speech?"
    labels:
      - name: Happy
        description: "Joy, excitement, amusement"
        keyboard_shortcut: "h"
      - name: Sad
        description: "Sorrow, disappointment, grief"
        keyboard_shortcut: "s"
      - name: Angry
        description: "Frustration, irritation, rage"
        keyboard_shortcut: "a"
      - name: Fearful
        description: "Anxiety, worry, terror"
        keyboard_shortcut: "f"
      - name: Surprised
        description: "Astonishment, shock"
        keyboard_shortcut: "u"
      - name: Disgusted
        description: "Revulsion, distaste"
        keyboard_shortcut: "d"
      - name: Neutral
        description: "No clear emotion"
        keyboard_shortcut: "n"
    required: true

Potato는 어노테이션 라벨과 함께 재생 컨트롤이 있는 인터랙티브 파형을 렌더링합니다:

파형 표시와 감정 라벨이 있는 오디오 감정 분류 인터페이스 재생 컨트롤과 범주형 감정 라벨이 있는 인터랙티브 파형을 보여주는 오디오 어노테이션 인터페이스

강도 추가하기

yaml

annotation_schemes:
  - annotation_type: radio
    name: emotion
    labels: [Happy, Sad, Angry, Fearful, Surprised, Disgusted, Neutral]
    required: true
 
  - annotation_type: likert
    name: intensity
    description: "How intense is this emotion?"
    size: 5
    min_label: "Very weak"
    max_label: "Very strong"
    conditional:
      depends_on: emotion
      hide_when: ["Neutral"]

차원 감정 어노테이션

VAD(정서가-각성도-지배성) 모델은 각 클립을 하나의 범주로 강제하는 대신 세 개의 연속 척도로 평가합니다:

yaml

annotation_task_name: "Dimensional Emotion Rating"
 
annotation_schemes:
  # Valence: negative to positive
  - annotation_type: likert
    name: valence
    description: "Valence: How positive or negative?"
    size: 7
    min_label: "Very negative"
    max_label: "Very positive"
 
  # Arousal: calm to excited
  - annotation_type: likert
    name: arousal
    description: "Arousal: How calm or excited?"
    size: 7
    min_label: "Very calm"
    max_label: "Very excited"
 
  # Dominance: submissive to dominant
  - annotation_type: likert
    name: dominance
    description: "Dominance: How submissive or dominant?"
    size: 7
    min_label: "Very submissive"
    max_label: "Very dominant"

시각적 척도 (SAM)

Self-Assessment Manikin 스타일:

yaml

annotation_schemes:
  - annotation_type: image_scale
    name: valence
    description: "Select the figure that matches the emotional valence"
    images:
      - path: /images/sam_valence_1.png
        value: 1
      - path: /images/sam_valence_2.png
        value: 2
      # ... etc
    size: 9

혼합 감정 탐지

여러 감정이 포함된 음성의 경우:

yaml

annotation_schemes:
  - annotation_type: multiselect
    name: emotions_present
    description: "Select ALL emotions you detect (can be multiple)"
    labels:
      - Happy
      - Sad
      - Angry
      - Fearful
      - Surprised
      - Disgusted
      - Contempt
    min_selections: 1
 
  - annotation_type: radio
    name: primary_emotion
    description: "Which emotion is MOST prominent?"
    labels:
      - Happy
      - Sad
      - Angry
      - Fearful
      - Surprised
      - Disgusted
      - Contempt
      - Mixed (no dominant)

종합 감정 어노테이션

yaml

annotation_task_name: "Comprehensive Speech Emotion Annotation"
 
data_files:
  - data/speech_samples.json
 
item_properties:
  id_key: id
  audio_key: audio_url
  text_key: transcript
 
audio:
  enabled: true
  display: waveform
  waveform_color: "#EC4899"
  progress_color: "#F472B6"
  height: 120
  speed_control: true
  speed_options: [0.5, 0.75, 1.0, 1.25]
  show_duration: true
  autoplay: false
 
# Show transcript if available
display:
  show_text: true
  text_field: transcript
  text_label: "Transcript (for reference)"
 
annotation_schemes:
  # Primary categorical emotion
  - annotation_type: radio
    name: primary_emotion
    description: "Primary emotion expressed"
    labels:
      - name: Happiness
        color: "#FCD34D"
        keyboard_shortcut: "1"
      - name: Sadness
        color: "#60A5FA"
        keyboard_shortcut: "2"
      - name: Anger
        color: "#F87171"
        keyboard_shortcut: "3"
      - name: Fear
        color: "#A78BFA"
        keyboard_shortcut: "4"
      - name: Surprise
        color: "#34D399"
        keyboard_shortcut: "5"
      - name: Disgust
        color: "#FB923C"
        keyboard_shortcut: "6"
      - name: Neutral
        color: "#9CA3AF"
        keyboard_shortcut: "7"
    required: true
 
  # Emotional intensity
  - annotation_type: likert
    name: intensity
    description: "Emotional intensity"
    size: 5
    min_label: "Very mild"
    max_label: "Very intense"
    required: true
 
  # Dimensional ratings
  - annotation_type: likert
    name: valence
    description: "Valence (negative to positive)"
    size: 7
    min_label: "Negative"
    max_label: "Positive"
 
  - annotation_type: likert
    name: arousal
    description: "Arousal (calm to excited)"
    size: 7
    min_label: "Calm"
    max_label: "Excited"
 
  # Voice quality
  - annotation_type: multiselect
    name: voice_qualities
    description: "Voice characteristics (select all that apply)"
    labels:
      - Trembling voice
      - Raised pitch
      - Lowered pitch
      - Loud/shouting
      - Soft/whisper
      - Fast speech rate
      - Slow speech rate
      - Breathy
      - Tense/strained
      - Crying
      - Laughing
 
  # Genuineness
  - annotation_type: radio
    name: authenticity
    description: "Does the emotion seem genuine?"
    labels:
      - Clearly genuine
      - Likely genuine
      - Uncertain
      - Likely acted/fake
      - Clearly acted/fake
 
  # Confidence
  - annotation_type: likert
    name: confidence
    description: "How confident are you in your annotation?"
    size: 5
    min_label: "Guessing"
    max_label: "Certain"
 
annotation_guidelines:
  title: "Emotion Annotation Guidelines"
  content: |
    ## Listening Instructions
    1. Listen to the entire clip before annotating
    2. You may replay as many times as needed
    3. Focus on the VOICE, not just the words
 
    ## Emotion Categories
    - **Happiness**: Joy, amusement, contentment
    - **Sadness**: Sorrow, disappointment, melancholy
    - **Anger**: Frustration, irritation, rage
    - **Fear**: Anxiety, nervousness, terror
    - **Surprise**: Astonishment, startle
    - **Disgust**: Revulsion, contempt
    - **Neutral**: Calm, matter-of-fact
 
    ## Tips
    - Consider tone, pitch, speaking rate
    - The transcript may not match the emotion
    - When unsure between two emotions, choose the stronger one
    - Use the intensity scale for unclear cases
 
output_annotation_dir: annotations/
export_annotation_format: jsonl

출력 형식

json

{
  "id": "utt_001",
  "audio_url": "/audio/sample_001.wav",
  "transcript": "I can't believe this happened!",
  "annotations": {
    "primary_emotion": "Surprise",
    "intensity": 4,
    "valence": 2,
    "arousal": 6,
    "voice_qualities": ["Raised pitch", "Fast speech rate"],
    "authenticity": "Clearly genuine",
    "confidence": 4
  },
  "annotator": "rater_01",
  "timestamp": "2024-12-05T10:30:00Z"
}

세그먼트 단위 감정

감정이 변하는 더 긴 오디오의 경우:

yaml

annotation_schemes:
  - annotation_type: audio_segments
    name: emotion_segments
    description: "Mark time segments with different emotions"
    labels:
      - name: Happy
        color: "#FCD34D"
      - name: Sad
        color: "#60A5FA"
      - name: Angry
        color: "#F87171"
      - name: Neutral
        color: "#9CA3AF"
 
    segment_attributes:
      - name: intensity
        type: likert
        size: 5

품질 관리

yaml

quality_control:
  attention_checks:
    enabled: true
    gold_items:
      - audio: "/audio/gold/clearly_happy.wav"
        expected:
          primary_emotion: "Happiness"
          intensity: [4, 5]  # Accept 4 or 5
      - audio: "/audio/gold/clearly_angry.wav"
        expected:
          primary_emotion: "Anger"

감정 어노테이션 팁

결정하기 전에 클립 전체를 끝까지 들으십시오. 그리고 단어 자체보다 그것이 어떻게 말해지는지에 주의를 기울이십시오. 표현 규범은 문화마다 다르다는 점을 염두에 두십시오. 한 문화에서 분노로 읽히는 것이 다른 문화에서는 강조로 읽힐 수 있습니다. 감정 어노테이션은 지치는 작업이므로 휴식을 권장하고, 팀이 정기적으로 의견 불일치를 논의하여 보정 상태를 유지하도록 하십시오.

다음 단계

다중 화자 감정 추적을 위해 화자 분리를 추가하십시오
대규모 수집을 위해 크라우드소싱을 설정하십시오
감정 작업에 대한 어노테이터 간 일치도를 계산하십시오

문서는 /docs/features/audio-annotation에 있습니다.