SODA: Social Dialogue Quality Evaluation
Evaluate the quality of socially-grounded dialogues generated through social commonsense contextualization. Annotators rate naturalness, coherence, social appropriateness, and commonsense alignment of dialogues situated in specific social contexts and interpersonal relationships.
Fichier de configurationconfig.yaml
# SODA: Social Dialogue Quality Evaluation
# Based on "SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization" (Kim et al., EMNLP 2023)
# Task: Evaluate quality of socially-grounded dialogues
annotation_task_name: "SODA Social Dialogue Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "narrative"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing social context and dialogue in chat format
html_layout: |
<div class="soda-container" style="font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto;">
<div class="metadata-bar" style="display: flex; gap: 12px; margin-bottom: 14px; flex-wrap: wrap;">
<div style="background: #e8eaf6; padding: 7px 14px; border-radius: 8px;">
<strong>Relationship:</strong> {{relation}}
</div>
</div>
<div class="context-section" style="background: #f3e5f5; padding: 16px; border-radius: 8px; margin-bottom: 16px; border-left: 5px solid #9c27b0;">
<h3 style="margin-top: 0; color: #6a1b9a;">Social Context / Narrative</h3>
<div style="font-size: 15px; line-height: 1.6;">{{narrative}}</div>
</div>
<div class="commonsense-section" style="background: #fff8e1; padding: 14px; border-radius: 8px; margin-bottom: 16px; border-left: 5px solid #ffa000;">
<h4 style="margin-top: 0; color: #e65100;">Head Event (Social Commonsense)</h4>
<div style="font-size: 14px; font-style: italic;">{{head_event}}</div>
</div>
<div class="dialogue-section" style="background: #fafafa; padding: 16px; border-radius: 8px; border: 1px solid #e0e0e0;">
<h3 style="margin-top: 0; color: #424242;">Dialogue</h3>
<div class="dialogue-turns" style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{dialogue}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Naturalness rating
- name: "naturalness"
description: "How natural does this dialogue sound? Would real people in this situation talk like this?"
annotation_type: radio
labels:
- name: "very-natural"
tooltip: "The dialogue sounds completely natural and realistic for the given context"
key_value: "1"
- name: "somewhat-natural"
tooltip: "The dialogue mostly sounds natural with minor awkward moments"
key_value: "2"
- name: "neutral"
tooltip: "The dialogue is neither clearly natural nor clearly unnatural"
key_value: "3"
- name: "somewhat-unnatural"
tooltip: "The dialogue has several unnatural-sounding turns or phrasings"
key_value: "4"
- name: "very-unnatural"
tooltip: "The dialogue sounds robotic, scripted, or implausible for the context"
key_value: "5"
# Coherence rating
- name: "coherence"
description: "How coherent is the dialogue? Do the turns logically follow each other?"
annotation_type: radio
labels:
- name: "fully-coherent"
tooltip: "All turns follow logically; the conversation flows perfectly"
key_value: "q"
- name: "mostly-coherent"
tooltip: "Most turns connect well with minor coherence issues"
key_value: "w"
- name: "somewhat-coherent"
tooltip: "Some turns connect but there are noticeable coherence gaps"
key_value: "e"
- name: "mostly-incoherent"
tooltip: "Many turns do not follow logically from previous turns"
key_value: "r"
- name: "fully-incoherent"
tooltip: "The turns seem disconnected or randomly ordered"
key_value: "t"
# Social appropriateness
- name: "social_appropriateness"
description: "Is the dialogue socially appropriate for the described relationship and context?"
annotation_type: radio
labels:
- name: "appropriate"
tooltip: "Perfectly appropriate for the relationship and social context"
- name: "mostly-appropriate"
tooltip: "Largely appropriate with minor social missteps"
- name: "neutral"
tooltip: "Neither clearly appropriate nor inappropriate"
- name: "mostly-inappropriate"
tooltip: "Several socially awkward or inappropriate exchanges"
- name: "inappropriate"
tooltip: "The dialogue is clearly inappropriate for the given context"
# Commonsense alignment
- name: "commonsense_alignment"
description: "How well does the dialogue align with the social commonsense context (the head event and narrative)?"
annotation_type: radio
labels:
- name: "strong-alignment"
tooltip: "The dialogue clearly reflects and is grounded in the social context"
- name: "moderate-alignment"
tooltip: "The dialogue mostly connects to the social context"
- name: "weak-alignment"
tooltip: "The dialogue has only loose connection to the social context"
- name: "no-alignment"
tooltip: "The dialogue does not relate to the provided social context"
# Free-text feedback
- name: "feedback"
description: "Provide specific feedback on dialogue quality. Note any unnatural turns, coherence breaks, or social missteps."
annotation_type: text
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2
# Detailed annotation instructions
annotation_instructions: |
## SODA: Social Dialogue Quality Evaluation
You are evaluating dialogues that were generated using social commonsense
contextualization. Each dialogue is grounded in a social narrative and a
commonsense event describing the relationship between speakers.
### Your Tasks:
1. **Read the social context** (narrative) and head event carefully.
2. **Read the full dialogue** before rating.
3. **Rate naturalness**: Does this sound like a real conversation between people
in this situation?
4. **Rate coherence**: Do the dialogue turns logically follow each other?
5. **Rate social appropriateness**: Is the tone and content appropriate for the
described relationship (friends, family, coworkers, strangers, etc.)?
6. **Rate commonsense alignment**: Does the dialogue reflect the social context?
7. **Provide feedback**: Note specific turns that are problematic and explain why.
### What to Look For:
- **Natural dialogue** uses contractions, filler words, varied sentence length.
- **Coherent dialogue** maintains topic continuity and responds to previous turns.
- **Socially appropriate dialogue** matches the register expected for the relationship
(e.g., friends are casual, coworkers may be more formal).
- **Aligned dialogue** reflects the narrative situation and social commonsense.
### Common Issues:
- Overly formal language between close friends.
- Sudden topic changes without transition.
- Generic responses that could fit any context.
- Ignoring or contradicting the provided social narrative.
- Repetitive phrasing across turns.
Données d'exemplesample-data.json
[
{
"id": "soda_001",
"narrative": "Alex and Jordan have been close friends since college. Alex just got promoted at work and is excited to share the news. They meet at their favorite coffee shop after work.",
"dialogue": "Speaker 1 (Alex): Hey! You're not going to believe this. I got the promotion!\nSpeaker 2 (Jordan): No way! Are you serious? That's amazing, Alex!\nSpeaker 1 (Alex): Dead serious. They told me this morning. I'm going to be leading the entire product team.\nSpeaker 2 (Jordan): I knew it was going to happen. You've been working so hard on that project. When do you start?\nSpeaker 1 (Alex): Next Monday. I'm honestly a little nervous. It's a big jump.\nSpeaker 2 (Jordan): You're going to crush it. Remember when you thought you couldn't handle the presentation last quarter? You ended up getting a standing ovation.\nSpeaker 1 (Alex): Ha, that's true. Thanks, Jordan. It really means a lot to have your support.\nSpeaker 2 (Jordan): Always. Now let me buy you a celebratory coffee. Actually, make that two.",
"relation": "friends",
"head_event": "PersonX gets promoted at work"
},
{
"id": "soda_002",
"narrative": "A mother calls her adult son who has been living abroad for two years. She is worried because he hasn't called in three weeks and she heard about severe weather in his area.",
"dialogue": "Speaker 1 (Mom): Sweetie, I've been trying to reach you for days. Are you okay?\nSpeaker 2 (Son): Hi Mom, yeah I'm fine. Sorry I haven't called. Things have been really hectic at the lab.\nSpeaker 1 (Mom): I saw on the news there were terrible storms over there. Your father and I were worried sick.\nSpeaker 2 (Son): Oh, the storms were mostly south of the city. We just had some rain here. I should have called to let you know.\nSpeaker 1 (Mom): Yes, you should have. Three weeks without hearing from you, and then I see flooding on television. What am I supposed to think?\nSpeaker 2 (Son): You're right, Mom. I'm sorry. I'll set a reminder to call every weekend. I promise.\nSpeaker 1 (Mom): That's all I ask. Now tell me about your research. Are you eating properly?",
"relation": "family",
"head_event": "PersonX worries about PersonY's safety"
}
]
// ... and 8 more itemsObtenir ce design
Clone or download from the repository
Démarrage rapide :
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/multimodal/soda-eval-social-dialogue potato start config.yaml
Détails
Types d'annotation
Domaine
Cas d'utilisation
Étiquettes
Vous avez trouvé un problème ou souhaitez améliorer ce design ?
Ouvrir un ticketDesigns associés
MediTOD Medical Dialogue Annotation
Medical history-taking dialogue annotation based on the MediTOD dataset. Annotators label dialogue acts, identify medical entities (symptoms, conditions, medications, tests), and assess doctor-patient communication quality across multi-turn clinical conversations.
Bias Benchmark for QA (BBQ)
Annotate question-answering examples designed to probe social biases. Based on BBQ (Parrish et al., Findings of ACL 2022). Annotators select the correct answer given a context, assess the direction of bias in the question, categorize the type of bias, and explain their reasoning.
BIG-Bench Task Evaluation
Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.