CMU-MOSEI Multimodal Sentiment Multi-Tier Annotation
Multi-tier ELAN-style annotation of multimodal sentiment and emotion in YouTube opinion videos. Annotators segment visual behaviors and acoustic events on parallel timeline tiers, classify emotions and sentiment polarity, and transcribe speech for the CMU-MOSEI dataset.
Configuration Fileconfig.yaml
# CMU-MOSEI Multimodal Sentiment Multi-Tier Annotation Configuration
# Based on Zadeh et al., ACL 2018
# Paper: https://aclanthology.org/P18-1208/
# Task: ELAN-style multi-tier annotation of visual behavior, acoustic events, emotion, and sentiment
annotation_task_name: "CMU-MOSEI Multimodal Sentiment Multi-Tier Annotation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "video_url"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Annotation schemes - ELAN-style parallel tiers aligned to the video timeline
annotation_schemes:
# Tier 1: Visual behavior segmentation
- name: "visual_behavior_tier"
description: |
Segment the video timeline by the speaker's visible facial expressions,
head movements, and gestures. Mark the onset and offset of each distinct
visual behavior observed.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "neutral-face"
color: "#9CA3AF"
tooltip: "Neutral facial expression with no strong affect signal"
- name: "smile"
color: "#22C55E"
tooltip: "Visible smile or positive facial expression"
- name: "frown"
color: "#EF4444"
tooltip: "Frown, grimace, or negative facial expression"
- name: "eyebrow-raise"
color: "#A855F7"
tooltip: "Raised eyebrows indicating surprise or emphasis"
- name: "head-nod"
color: "#3B82F6"
tooltip: "Vertical head nod indicating agreement or affirmation"
- name: "head-shake"
color: "#F97316"
tooltip: "Horizontal head shake indicating disagreement or negation"
- name: "gesture"
color: "#14B8A6"
tooltip: "Hand or arm gesture accompanying speech"
- name: "gaze-away"
color: "#6B7280"
tooltip: "Speaker looking away from camera (thinking, reading, etc.)"
show_timecode: true
video_fps: 30
# Tier 2: Acoustic event segmentation
- name: "acoustic_tier"
description: |
Segment the audio timeline by notable acoustic events and prosodic
patterns. Mark pitch changes, emphasis, pauses, laughter, and fillers
that carry affective information.
annotation_type: "video_annotation"
mode: "segment"
labels:
- name: "rising-pitch"
color: "#3B82F6"
tooltip: "Rising intonation pattern (questions, uncertainty, excitement)"
- name: "falling-pitch"
color: "#6366F1"
tooltip: "Falling intonation pattern (statements, certainty, finality)"
- name: "emphasis"
color: "#EF4444"
tooltip: "Stressed or emphasized word/phrase with increased loudness"
- name: "pause"
color: "#9CA3AF"
tooltip: "Noticeable silence or pause in speech"
- name: "laughter"
color: "#22C55E"
tooltip: "Audible laughter or chuckling"
- name: "filler"
color: "#F59E0B"
tooltip: "Filler words or hesitation markers (um, uh, like, you know)"
show_timecode: true
video_fps: 30
# Tier 3: Emotion classification
- name: "emotion"
description: "Classify the dominant emotion expressed by the speaker in this segment."
annotation_type: radio
labels:
- "happiness"
- "sadness"
- "anger"
- "fear"
- "disgust"
- "surprise"
keyboard_shortcuts:
happiness: "1"
sadness: "2"
anger: "3"
fear: "4"
disgust: "5"
surprise: "6"
# Tier 4: Sentiment polarity (7-point Likert-style scale)
- name: "sentiment_polarity"
description: "Rate the overall sentiment polarity of the speaker's opinion on a 7-point scale."
annotation_type: radio
labels:
- "strongly-negative"
- "negative"
- "weakly-negative"
- "neutral"
- "weakly-positive"
- "positive"
- "strongly-positive"
# Tier 5: Speech transcription (free text)
- name: "transcription"
description: "Transcribe the speaker's utterance verbatim, including fillers and false starts."
annotation_type: text
textarea: true
# HTML layout
html_layout: |
<div style="max-width: 900px; margin: 0 auto;">
<h3 style="margin-bottom: 8px;">CMU-MOSEI: Multi-Tier Multimodal Sentiment Annotation</h3>
<p style="color: #666; font-size: 14px; margin-bottom: 16px;">
Annotate visual behaviors, acoustic events, emotion, and sentiment across
parallel timeline tiers for multimodal sentiment analysis.
</p>
<div style="text-align: center; margin-bottom: 20px;">
<video controls width="720" style="max-width: 100%; border-radius: 8px; border: 1px solid #ddd;">
<source src="{{video_url}}" type="video/mp4">
Your browser does not support video playback.
</video>
</div>
<div style="background: #f8f9fa; padding: 12px; border-radius: 6px; margin-bottom: 16px; font-size: 13px;">
<strong>Multi-Tier Instructions:</strong> Annotate the video across five parallel tiers:
visual behavior segments, acoustic event segments, emotion category, sentiment polarity,
and verbatim transcription. Each modality provides complementary sentiment cues.
</div>
</div>
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 30
annotation_per_instance: 2
# Instructions
annotation_instructions: |
## CMU-MOSEI Multimodal Sentiment Multi-Tier Annotation
This task uses ELAN-style multi-tier annotation to capture visual, acoustic,
and linguistic signals of sentiment and emotion in YouTube opinion videos.
### Tier 1: Visual Behavior Segmentation
- Segment the video timeline based on the speaker's facial expressions and movements:
- **Neutral face**: No strong affect visible
- **Smile**: Positive facial expression (Duchenne smile, grin, etc.)
- **Frown**: Negative facial expression (furrowed brow, pursed lips)
- **Eyebrow raise**: Surprise, emphasis, or question
- **Head nod/shake**: Agreement or disagreement signals
- **Gesture**: Communicative hand/arm movements
- **Gaze away**: Speaker looking away from camera
### Tier 2: Acoustic Event Segmentation
- Segment the audio timeline by prosodic and vocal events:
- **Rising/falling pitch**: Intonation contour changes
- **Emphasis**: Louder or stressed words/phrases
- **Pause**: Noticeable silence in the speech stream
- **Laughter**: Any audible laughter
- **Filler**: Hesitation markers (um, uh, like, you know)
### Tier 3: Emotion Classification
- Select the single dominant emotion expressed in this clip
- Choose from: happiness, sadness, anger, fear, disgust, surprise
### Tier 4: Sentiment Polarity
- Rate the overall opinion sentiment on a 7-point scale
- Consider both what is said and how it is said (facial expression, tone)
- Scale: strongly-negative to strongly-positive
### Tier 5: Speech Transcription
- Transcribe the speaker's words verbatim
- Include fillers (um, uh), false starts, and self-corrections
- Use standard punctuation to indicate prosodic phrasing
### Multimodal Integration Tips
- Visual and acoustic tiers may not align perfectly; annotate each independently
- A smile during negative words may indicate sarcasm; note this in transcription
- Pay attention to mismatches between modalities as these are analytically important
- Use slow-motion playback to catch subtle facial expressions
Sample Datasample-data.json
[
{
"id": "mosei_001",
"video_url": "https://example.com/videos/cmu-mosei/opinion_electronics_001.mp4",
"speaker_id": "speaker_142",
"topic": "review of new wireless headphones",
"duration_seconds": 18.5,
"source": "YouTube"
},
{
"id": "mosei_002",
"video_url": "https://example.com/videos/cmu-mosei/opinion_movie_001.mp4",
"speaker_id": "speaker_087",
"topic": "reaction to a recent blockbuster film",
"duration_seconds": 22.1,
"source": "YouTube"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/video/cmu-mosei-multimodal-sentiment potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
IEMOCAP Dyadic Emotion Multi-Tier Annotation
Multi-tier ELAN-style annotation of emotional dyadic interactions. Annotators segment per-speaker behavior on parallel timeline tiers, classify discrete emotion categories, and rate dimensional affect (valence, activation, dominance) on Likert-style scales. Based on the IEMOCAP motion capture database.
CHILDES Child Language Multi-Tier Annotation
Multi-tier ELAN-style annotation of child-adult interaction videos for language acquisition research. Annotators segment utterance boundaries on the timeline, provide morphological and syntactic annotations, and classify communicative context and error types. Based on the CHILDES/TalkBank project.
DGS Corpus Sign Language Multi-Tier Annotation
Multi-tier ELAN-style annotation of German Sign Language (DGS) corpus videos. Annotators segment sign types, mouth gestures, non-manual signals, classify discourse functions, and provide German translations across parallel tiers aligned to the video timeline.