Showcase/IEMOCAP: Dyadic Emotion Recognition Dataset

advancedtext

IEMOCAP: Dyadic Emotion Recognition Dataset

IEMOCAP is a 12-hour audiovisual corpus of two-actor emotional conversations with categorical and dimensional emotion labels. This Potato config reproduces its per-speaker tagging on parallel timeline tiers.

About this dataset

IEMOCAP (Interactive Emotional Dyadic Motion Capture database) is a multimodal corpus of acted emotional conversations recorded by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California. It was released by Busso et al. in Language Resources and Evaluation in 2008.

The corpus holds roughly 12 hours of audiovisual data from ten actors (five male, five female) recorded across five dyadic sessions. Each session pairs two actors of opposite gender in scripted plays and improvised scenarios. Markers on the face, head, and hands captured facial expression and hand motion alongside the speech and video.

Two to four annotators labeled each segment two ways: categorical emotions (anger, happiness, sadness, neutrality, frustration, excitement, and others) and dimensional affect, rating valence, activation, and dominance on a 5-point Likert scale. Both label sets are widely used to train and benchmark speech and multimodal emotion recognition models.

The Potato config below reproduces this multi-tier annotation: annotators segment per-speaker behavior on parallel timeline tiers, assign a discrete emotion category, and rate valence, activation, and dominance on Likert-style scales.

Duration: ~12 hours
Actors: 10 (5 male, 5 female)
Sessions: 5 dyadic
Session types: Scripted + improvised
Dimensional labels: Valence, activation, dominance (1-5)
Annotators per segment: 2 to 4

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# IEMOCAP Dyadic Emotion Multi-Tier Annotation Configuration
# Based on Busso et al., Language Resources and Evaluation 2008
# Paper: https://doi.org/10.1007/s10579-008-9076-6
# Task: ELAN-style multi-tier annotation of emotional dyadic interactions

annotation_task_name: "IEMOCAP Dyadic Emotion Multi-Tier Annotation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Annotation schemes - ELAN-style parallel tiers aligned to the video timeline
annotation_schemes:
  # Tier 1: Speaker A behavior segmentation
  - name: "speaker_a_tier"
    description: |
      Segment the timeline by Speaker A's observable behavior. Mark when they
      are speaking, listening, producing backchannels, laughing, or silent.
      This tier tracks Speaker A's contribution to the dyadic interaction.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "speaking"
        color: "#3B82F6"
        tooltip: "Speaker A is actively speaking or vocalizing"
      - name: "listening"
        color: "#10B981"
        tooltip: "Speaker A is silently attending to Speaker B"
      - name: "backchannel"
        color: "#F59E0B"
        tooltip: "Speaker A produces a brief backchannel response (mm-hmm, yeah, uh-huh)"
      - name: "laughing"
        color: "#EC4899"
        tooltip: "Speaker A is laughing (with or without speech)"
      - name: "silence"
        color: "#9CA3AF"
        tooltip: "Speaker A is silent and not visibly engaged (pause, thinking)"
    show_timecode: true
    video_fps: 30

  # Tier 2: Speaker B behavior segmentation
  - name: "speaker_b_tier"
    description: |
      Segment the timeline by Speaker B's observable behavior. Mark when they
      are speaking, listening, producing backchannels, laughing, or silent.
      This tier tracks Speaker B's contribution to the dyadic interaction.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "speaking"
        color: "#6366F1"
        tooltip: "Speaker B is actively speaking or vocalizing"
      - name: "listening"
        color: "#14B8A6"
        tooltip: "Speaker B is silently attending to Speaker A"
      - name: "backchannel"
        color: "#F97316"
        tooltip: "Speaker B produces a brief backchannel response (mm-hmm, yeah, uh-huh)"
      - name: "laughing"
        color: "#A855F7"
        tooltip: "Speaker B is laughing (with or without speech)"
      - name: "silence"
        color: "#6B7280"
        tooltip: "Speaker B is silent and not visibly engaged (pause, thinking)"
    show_timecode: true
    video_fps: 30

  # Tier 3: Emotion category classification
  - name: "emotion_category"
    description: "Classify the dominant emotion expressed in this segment of the interaction."
    annotation_type: radio
    labels:
      - "neutral"
      - "happiness"
      - "sadness"
      - "anger"
      - "frustration"
      - "excitement"
      - "fear"
      - "surprise"
      - "disgust"
      - "other"
    keyboard_shortcuts:
      neutral: "0"
      happiness: "1"
      sadness: "2"
      anger: "3"
      frustration: "4"
      excitement: "5"

  # Tier 4: Valence rating (7-point Likert-style)
  - name: "valence"
    description: "Rate the emotional valence (pleasantness) on a 5-point scale from very negative to very positive."
    annotation_type: radio
    labels:
      - "1-very-negative"
      - "2-negative"
      - "3-neutral"
      - "4-positive"
      - "5-very-positive"

  # Tier 5: Activation/arousal rating (5-point Likert-style)
  - name: "activation"
    description: "Rate the emotional activation/arousal on a 5-point scale from very calm to very active."
    annotation_type: radio
    labels:
      - "1-very-calm"
      - "2-calm"
      - "3-neutral"
      - "4-active"
      - "5-very-active"

  # Tier 6: Dominance rating (5-point Likert-style)
  - name: "dominance"
    description: "Rate the perceived dominance/control on a 5-point scale from very submissive to very dominant."
    annotation_type: radio
    labels:
      - "1-very-submissive"
      - "2-submissive"
      - "3-neutral"
      - "4-dominant"
      - "5-very-dominant"

# HTML layout
html_layout: |
  <div style="max-width: 900px; margin: 0 auto;">
    <h3 style="margin-bottom: 8px;">IEMOCAP: Multi-Tier Dyadic Emotion Annotation</h3>
    <p style="color: #666; font-size: 14px; margin-bottom: 16px;">
      Annotate emotional dyadic interactions across parallel tiers for speaker behaviors,
      emotion categories, and dimensional affect ratings (valence, activation, dominance).
    </p>
    <div style="text-align: center; margin-bottom: 20px;">
      <video controls width="720" style="max-width: 100%; border-radius: 8px; border: 1px solid #ddd;">
        <source src="{{video_url}}" type="video/mp4">
        Your browser does not support video playback.
      </video>
    </div>
    <div style="background: #f8f9fa; padding: 12px; border-radius: 6px; margin-bottom: 16px; font-size: 13px;">
      <strong>Multi-Tier Instructions:</strong> Annotate the dyadic interaction across six
      parallel tiers: Speaker A behavior, Speaker B behavior, emotion category, valence,
      activation, and dominance. The two speaker tiers run in parallel to capture the
      dynamics of turn-taking and emotional co-regulation.
    </div>
  </div>

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2

# Instructions
annotation_instructions: |
  ## IEMOCAP Dyadic Emotion Multi-Tier Annotation

  This task uses ELAN-style multi-tier annotation to capture emotional dynamics
  in dyadic (two-person) interactions from the IEMOCAP database.

  ### Tier 1: Speaker A Behavior
  - Segment Speaker A's behavior throughout the interaction:
    - **Speaking**: Actively talking or vocalizing
    - **Listening**: Silently attending to Speaker B
    - **Backchannel**: Brief vocal feedback (mm-hmm, yeah, uh-huh, right)
    - **Laughing**: Audible laughter (may co-occur with speech)
    - **Silence**: Not engaged in speaking or active listening

  ### Tier 2: Speaker B Behavior
  - Segment Speaker B's behavior using the same labels
  - The two speaker tiers should run in parallel, allowing analysis of:
    - Turn-taking patterns and timing
    - Overlap and simultaneous speech
    - Listener behavior during the other's turn
    - Mutual laughter episodes

  ### Tier 3: Emotion Category
  - Classify the dominant emotion for the current segment:
    - **Neutral**: No strong emotional expression
    - **Happiness**: Joy, amusement, contentment
    - **Sadness**: Sorrow, disappointment, grief
    - **Anger**: Irritation, rage, hostility
    - **Frustration**: Annoyance, exasperation (distinct from anger)
    - **Excitement**: High-energy positive arousal
    - **Fear**: Anxiety, worry, apprehension
    - **Surprise**: Unexpected reaction (positive or negative)
    - **Disgust**: Revulsion, distaste
    - **Other**: Emotion not captured by the above categories
  - Rate the emotion expressed, not what you think they feel internally

  ### Tier 4: Valence (7-point scale)
  - How pleasant or unpleasant is the expressed emotion?
  - 1 = very negative/unpleasant, 4 = neutral, 7 = very positive/pleasant

  ### Tier 5: Activation (7-point scale)
  - How energetic or calm is the emotional expression?
  - 1 = very calm/low energy, 4 = neutral, 7 = very active/high energy
  - Note: Both positive (excitement) and negative (anger) emotions can be high activation

  ### Tier 6: Dominance (7-point scale)
  - How dominant or submissive does the speaker appear?
  - 1 = very submissive/controlled, 4 = neutral, 7 = very dominant/in control
  - Consider vocal power, posture, and conversational control

  ### Annotation Strategy
  - Watch each clip at least twice: once for overall impression, once for detail
  - Annotate speaker tiers first to establish the interaction structure
  - Then rate emotion, valence, activation, and dominance for each segment
  - Consider both audio (voice, prosody) and visual (face, body) cues
  - For scripted scenarios, rate the portrayed emotion, not acting quality
  - Frustration and anger are separate: frustration is lower arousal and less hostile

Sample Datasample-data.json

json

[
  {
    "id": "iemocap_001",
    "video_url": "https://example.com/videos/iemocap/ses01_script_argument_001.mp4",
    "session_id": "session_01",
    "scenario_type": "scripted",
    "speaker_a_gender": "female",
    "speaker_b_gender": "male",
    "duration_seconds": 28.3
  },
  {
    "id": "iemocap_002",
    "video_url": "https://example.com/videos/iemocap/ses01_improv_breakup_001.mp4",
    "session_id": "session_01",
    "scenario_type": "improvised",
    "speaker_a_gender": "female",
    "speaker_b_gender": "male",
    "duration_seconds": 35.7
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/iemocap-dyadic-emotion
potato start config.yaml

Dataset & paper

Busso et al., Language Resources and Evaluation 2008

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{busso2008iemocap,
  title={IEMOCAP: Interactive Emotional Dyadic Motion Capture Database},
  author={Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S},
  journal={Language Resources and Evaluation},
  volume={42},
  number={4},
  pages={335--359},
  year={2008},
  publisher={Springer}
}

Details

Annotation Types

video_annotationradio

Domain

Emotion RecognitionAffective ComputingMultimodal Analysis

Use Cases

Emotion DetectionAffective Dimension RatingDyadic Interaction Analysis

Related Designs

CMU-MOSEI: Multimodal Sentiment and Emotion Dataset

CMU-MOSEI is the largest multimodal dataset for sentiment and emotion analysis, with 23,453 annotated YouTube clips spanning text, audio, and video. This Potato config reproduces its multi-tier timeline annotation.

video_annotationradio

AMI Meeting Multi-Tier Annotation

Multi-tier ELAN-style annotation of multi-party meeting recordings. Annotators segment speaker turns, head gestures, and focus of attention on parallel timeline tiers, then classify dialogue acts and topic segments. Based on the AMI Meeting Corpus.

video_annotationradio

CHILDES Child Language Multi-Tier Annotation

Multi-tier ELAN-style annotation of child-adult interaction videos for language acquisition research. Annotators segment utterance boundaries on the timeline, provide morphological and syntactic annotations, and classify communicative context and error types. Based on the CHILDES/TalkBank project.

video_annotationtext

IEMOCAP: Dyadic Emotion Recognition Dataset

About this dataset

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Dataset & paper

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

CMU-MOSEI: Multimodal Sentiment and Emotion Dataset

AMI Meeting Multi-Tier Annotation

CHILDES Child Language Multi-Tier Annotation