VoiceMOS Challenge 2024 - Speech Quality Assessment

Speech quality assessment using Mean Opinion Score (MOS). Annotators rate synthesized or processed speech on naturalness, intelligibility, and overall quality on 1-5 scales (Cooper et al., INTERSPEECH 2024).

ملف الإعدادconfig.yaml

# VoiceMOS Challenge 2024 - Speech Quality Assessment
# Based on Cooper et al., INTERSPEECH 2024
# Paper: https://www.isca-archive.org/interspeech_2024/cooper24_interspeech.html
# Dataset: https://voicemos-challenge-2024.github.io/
#
# Task: Rate synthesized/processed speech quality using Mean Opinion Score (MOS).
# Evaluate naturalness, intelligibility, and overall quality on 1-5 scales.
#
# Guidelines:
# - Listen to each clip fully before rating
# - Rate naturalness: How close to natural human speech?
# - Rate intelligibility: How easy is it to understand the words?
# - Rate overall quality: General impression of the speech quality
# - Use the full 1-5 scale; avoid always rating in the middle

annotation_task_name: "VoiceMOS Challenge 2024: Speech Quality Assessment"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "audio_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - annotation_type: likert
    name: naturalness
    description: "How natural does the speech sound? (1 = Very unnatural, 5 = Completely natural)"
    size: 5
    min_label: "Very unnatural"
    max_label: "Completely natural"
    labels:
      - "1 - Very unnatural"
      - "2 - Somewhat unnatural"
      - "3 - Moderately natural"
      - "4 - Mostly natural"
      - "5 - Completely natural"

  - annotation_type: likert
    name: intelligibility
    description: "How easy is it to understand the speech? (1 = Unintelligible, 5 = Perfectly clear)"
    size: 5
    min_label: "Unintelligible"
    max_label: "Perfectly clear"
    labels:
      - "1 - Unintelligible"
      - "2 - Mostly unintelligible"
      - "3 - Somewhat intelligible"
      - "4 - Mostly intelligible"
      - "5 - Perfectly clear"

  - annotation_type: likert
    name: overall_quality
    description: "What is the overall quality of this speech? (1 = Bad, 5 = Excellent)"
    size: 5
    min_label: "Bad"
    max_label: "Excellent"
    labels:
      - "1 - Bad"
      - "2 - Poor"
      - "3 - Fair"
      - "4 - Good"
      - "5 - Excellent"

audio_display:
  show_waveform: true
  playback_controls: true
  allow_speed_control: true

allow_all_users: true
instances_per_annotator: 200
annotation_per_instance: 5
allow_skip: true
skip_reason_required: false

بيانات نموذجيةsample-data.json

[
  {
    "id": "voicemos_001",
    "audio_url": "https://example.com/audio/voicemos/tts_system_a_001.wav",
    "system_id": "system_a",
    "duration": 4.2,
    "text_content": "The weather forecast for tomorrow predicts clear skies and mild temperatures."
  },
  {
    "id": "voicemos_002",
    "audio_url": "https://example.com/audio/voicemos/tts_system_b_001.wav",
    "system_id": "system_b",
    "duration": 3.8,
    "text_content": "Please remember to submit your report by the end of the business day."
  }
]

// ... and 8 more items

احصل على هذا التصميم

View on GitHub

Clone or download from the repository

بدء سريع:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/audio/voicemos-quality-assessment
potato start config.yaml

التفاصيل

أنواع التوسيم

likert

المجال

Speech ProcessingQuality Assessment

حالات الاستخدام

Speech Quality RatingTTS EvaluationVoice Synthesis Assessment

الوسوم

audiospeech-qualitymosttsvoice-synthesisinterspeech2024

وجدت مشكلة أو تريد تحسين هذا التصميم؟

افتح مشكلة

تصاميم ذات صلة

EmoBox - Multilingual Speech Emotion Recognition

Multilingual speech emotion recognition across multiple languages and corpora. Annotators classify emotional states in speech clips and rate emotional intensity, based on the EmoBox toolkit and benchmark (Ma et al., INTERSPEECH 2024).

radiolikert

Acoustic Scene Classification

Classify audio recordings by acoustic environment following the TUT/DCASE dataset format.

radiolikert

Audio Transcription Review

Review and correct automatic speech recognition transcriptions with waveform visualization.

likertmultiselect