DISPLACE 2024 - Speaker and Language Diarization

Speaker and language diarization in multilingual conversational audio. Annotators mark speaker turn boundaries, identify speakers, and label the language of each segment in conversational environments (Kundu et al., INTERSPEECH 2024).

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# DISPLACE 2024 - Speaker and Language Diarization
# Based on Kundu et al., INTERSPEECH 2024
# Paper: https://www.isca-archive.org/interspeech_2024/kundu24_interspeech.html
# Dataset: https://displace2024.github.io/
#
# Task: Speaker and language diarization in multilingual conversational audio.
# Mark speaker turn boundaries, identify speakers, and label language per segment.
#
# Guidelines:
# - Listen to the full conversation to identify distinct speakers
# - Mark temporal boundaries where speaker changes occur
# - Assign consistent speaker labels throughout the conversation
# - Identify the language spoken in each segment
# - Note any overlapping speech or code-switching

annotation_task_name: "DISPLACE 2024: Speaker and Language Diarization"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "audio_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - annotation_type: span
    name: speaker_segments
    description: "Mark temporal speaker segments with start and end times (in seconds)"
    span_mode: temporal
    labels:
      - name: "Speaker 1"
        color: "#3B82F6"
        tooltip: "First identified speaker"
      - name: "Speaker 2"
        color: "#EF4444"
        tooltip: "Second identified speaker"
      - name: "Speaker 3"
        color: "#10B981"
        tooltip: "Third identified speaker"
      - name: "Speaker 4"
        color: "#F59E0B"
        tooltip: "Fourth identified speaker"
      - name: "Speaker 5"
        color: "#8B5CF6"
        tooltip: "Fifth identified speaker"
      - name: "Overlap"
        color: "#6B7280"
        tooltip: "Multiple speakers talking simultaneously"

  - annotation_type: radio
    name: speaker_identity
    description: "Identify the current speaker for the selected segment"
    labels:
      - name: "Speaker 1"
        tooltip: "First/primary speaker in the conversation"
        key_value: "1"
      - name: "Speaker 2"
        tooltip: "Second speaker in the conversation"
        key_value: "2"
      - name: "Speaker 3"
        tooltip: "Third speaker (if present)"
        key_value: "3"
      - name: "Speaker 4"
        tooltip: "Fourth speaker (if present)"
        key_value: "4"
      - name: "Speaker 5"
        tooltip: "Fifth speaker (if present)"
        key_value: "5"
      - name: "Unknown"
        tooltip: "Cannot determine the speaker"
        key_value: "0"

  - annotation_type: radio
    name: language
    description: "What language is spoken in this segment?"
    labels:
      - name: "Hindi"
        tooltip: "Speaker is using Hindi"
        key_value: "h"
      - name: "English"
        tooltip: "Speaker is using English"
        key_value: "e"
      - name: "Tamil"
        tooltip: "Speaker is using Tamil"
        key_value: "t"
      - name: "Telugu"
        tooltip: "Speaker is using Telugu"
        key_value: "l"
      - name: "Kannada"
        tooltip: "Speaker is using Kannada"
        key_value: "k"
      - name: "Code-switched"
        tooltip: "Speaker switches between languages within this segment"
        key_value: "c"
      - name: "Other"
        tooltip: "Language not listed above"
        key_value: "o"

  - annotation_type: radio
    name: audio_quality
    description: "Rate the audio quality for this segment"
    labels:
      - name: "Clear"
        tooltip: "Audio is clear and easy to understand"
      - name: "Acceptable"
        tooltip: "Some noise but speech is understandable"
      - name: "Poor"
        tooltip: "Significant noise or distortion"
      - name: "Unintelligible"
        tooltip: "Cannot understand the speech"

audio_display:
  show_waveform: true
  playback_controls: true
  allow_speed_control: true
  show_spectrogram: true

allow_all_users: true
instances_per_annotator: 30
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "displace_001",
    "audio_url": "https://example.com/audio/displace/conversation_001.wav",
    "duration": 45.2,
    "num_speakers": 3,
    "languages": [
      "Hindi",
      "English"
    ]
  },
  {
    "id": "displace_002",
    "audio_url": "https://example.com/audio/displace/conversation_002.wav",
    "duration": 62.8,
    "num_speakers": 2,
    "languages": [
      "Tamil",
      "English"
    ]
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/audio/displace-speaker-diarization
potato start config.yaml

Dataset & paper

Kundu et al., INTERSPEECH 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{kundu24_interspeech,
    title = "{DISPLACE} 2024: {DI}arization of {SP}eaker and {LA}nguage in {C}onversational {E}nvironments",
    author = "Kundu, Shikha and others",
    booktitle = "Proceedings of INTERSPEECH 2024",
    year = "2024",
    publisher = "ISCA",
    url = "https://www.isca-archive.org/interspeech_2024/kundu24_interspeech.html"
}

Details

Annotation Types

radiospan

Domain

Speech ProcessingSpeaker Diarization

Use Cases

Speaker IdentificationLanguage IdentificationConversation Analysis

Related Designs

Speaker Diarization

Identify and label different speakers in audio recordings with timestamp-based segment annotation.

spanradio

ToBI Prosodic Annotation

Multi-tier prosodic annotation following the Tones and Break Indices (ToBI) framework. Annotators label pitch accents, phrase accents, boundary tones, and break indices on speech utterances, producing a layered prosodic transcription aligned to the audio timeline (Silverman et al., Speech Communication 1992).

spanradio

Adverse Drug Event Extraction (CADEC)

Named entity recognition for adverse drug events from patient-reported experiences, based on the CADEC corpus (Karimi et al., 2015). Annotates drugs, adverse effects, symptoms, diseases, and findings from colloquial health forum posts with mapping to medical vocabularies (SNOMED-CT, MedDRA).

spanradio

DISPLACE 2024 - Speaker and Language Diarization

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Dataset & paper

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

Speaker Diarization

ToBI Prosodic Annotation

Adverse Drug Event Extraction (CADEC)