CoVoST 2 - Speech Translation Evaluation

Speech translation quality evaluation based on the CoVoST 2 dataset (Wang et al., arXiv 2020). Annotators listen to source audio, review translations, label audio segments, and rate overall translation quality.

Configuration Fileconfig.yaml

yaml

# CoVoST 2 - Speech Translation Evaluation
# Based on Wang et al., arXiv 2020
# Paper: https://arxiv.org/abs/2007.10310
# Dataset: https://github.com/facebookresearch/covost
#
# This task evaluates speech translation quality. Annotators listen to audio
# in the source language, review the source transcript, provide or correct
# a translation, assess accuracy, label audio segments, and rate quality.
#
# Translation Accuracy:
# - Accurate: Translation correctly conveys the meaning of the source
# - Minor Errors: Small mistakes that do not significantly affect meaning
# - Major Errors: Significant mistakes that change or obscure the meaning
# - Incomprehensible: Translation does not convey the source meaning at all
#
# Audio Segment Labels:
# - Speech: Portions containing spoken language
# - Noise: Background noise or interference
# - Silence: Periods of no audio content
# - Music: Musical content in the background
#
# Annotation Guidelines:
# 1. Listen to the source audio
# 2. Review the source transcript
# 3. Provide or correct the translation
# 4. Assess the translation accuracy
# 5. Label audio segments
# 6. Rate the overall translation quality

annotation_task_name: "CoVoST 2 - Speech Translation Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: text
    name: translation
    description: "Provide or correct the translation of the source audio"

  - annotation_type: radio
    name: translation_accuracy
    description: "How accurate is the existing translation?"
    labels:
      - "Accurate"
      - "Minor Errors"
      - "Major Errors"
      - "Incomprehensible"
    keyboard_shortcuts:
      "Accurate": "1"
      "Minor Errors": "2"
      "Major Errors": "3"
      "Incomprehensible": "4"
    tooltips:
      "Accurate": "Translation correctly conveys the meaning of the source"
      "Minor Errors": "Small mistakes that do not significantly affect meaning"
      "Major Errors": "Significant mistakes that change or obscure the meaning"
      "Incomprehensible": "Translation does not convey the source meaning at all"

  - annotation_type: audio_annotation
    name: audio_segments
    description: "Label segments of the audio by content type"
    mode: "label"
    labels:
      - "Speech"
      - "Noise"
      - "Silence"
      - "Music"

  - annotation_type: likert
    name: overall_quality
    description: "Rate the overall quality of the translation"
    min_label: "Very Poor"
    max_label: "Excellent"
    size: 5

annotation_instructions: |
  You will be shown an audio clip in a source language along with its transcript
  and language information. Your task is to:
  1. Listen to the source audio clip.
  2. Review the source transcript provided.
  3. Provide or correct the translation into the target language.
  4. Assess the accuracy of the translation.
  5. Label audio segments (Speech, Noise, Silence, Music).
  6. Rate the overall translation quality on a 5-point scale.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="display: flex; gap: 10px; margin-bottom: 12px;">
      <div style="background: #dbeafe; border-radius: 8px; padding: 8px 12px;">
        <strong style="color: #1e40af;">Source:</strong> {{source_language}}
      </div>
      <div style="background: #dcfce7; border-radius: 8px; padding: 8px 12px;">
        <strong style="color: #166534;">Target:</strong> {{target_language}}
      </div>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px; text-align: center;">
      <audio controls style="width: 100%;">
        <source src="{{audio_url}}" type="audio/wav">
        Your browser does not support the audio element.
      </audio>
    </div>
    <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 16px;">
      <strong style="color: #475569;">Source Transcript:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "covost_001",
    "text": "Le temps est magnifique aujourd'hui, nous devrions aller nous promener dans le parc.",
    "audio_url": "audio/covost_fr_001.wav",
    "source_language": "French",
    "target_language": "English"
  },
  {
    "id": "covost_002",
    "text": "Die Wissenschaftler haben eine neue Methode zur Behandlung von Krebs entdeckt.",
    "audio_url": "audio/covost_de_002.wav",
    "source_language": "German",
    "target_language": "English"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/audio/covost-speech-translation
potato start config.yaml

Details

Annotation Types

textradioaudio_annotationlikert

Domain

AudioNLPTranslation

Use Cases

Speech TranslationTranslation QualityCross-lingual

Related Designs

Clotho Audio Captioning

Audio captioning and quality assessment based on the Clotho dataset (Drossos et al., ICASSP 2020). Annotators write natural language captions for audio clips, rate caption accuracy on a Likert scale, and classify the audio environment.

textlikert

Audio Transcription Review

Review and correct automatic speech recognition transcriptions with waveform visualization.

likertmultiselect

Speech Intelligibility Rating

Rate speech intelligibility for pathological speech following TORGO database annotation protocols.

likertradio