ESA: Error Span Annotation for Machine Translation

Error span annotation for machine translation output. Annotators identify error spans in translations, classify error types (accuracy, fluency, terminology, style), and rate severity.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# ESA: Error Span Annotation for Machine Translation
# Based on "Error Span Annotation for Machine Translation Evaluation" (Kocmi et al., WMT@EMNLP 2024)
# Task: Identify error spans in translations, classify error types, and rate severity

annotation_task_name: "ESA: Error Span Annotation for MT"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing source and translation
html_layout: |
  <div class="esa-container">
    <div class="source-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">Source Text ({{language_pair}}):</h3>
      <div class="source-text" style="font-size: 16px; line-height: 1.6;">{{source_text}}</div>
    </div>
    <div class="translation-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
      <h3 style="margin-top: 0; color: #1976d2;">Translation (select error spans below):</h3>
      <div class="translation-text" style="font-size: 16px; line-height: 1.6;">{{text}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Span annotation for error identification
  - name: "error_spans"
    description: "Select spans in the translation that contain errors. Highlight each error span individually."
    annotation_type: span
    labels:
      - "Accuracy - Mistranslation"
      - "Accuracy - Omission"
      - "Accuracy - Addition"
      - "Fluency - Grammar"
      - "Fluency - Spelling/Punctuation"
      - "Fluency - Register"
      - "Terminology"
      - "Style"
    label_colors:
      "Accuracy - Mistranslation": "#ff5252"
      "Accuracy - Omission": "#ff7043"
      "Accuracy - Addition": "#ff9800"
      "Fluency - Grammar": "#ab47bc"
      "Fluency - Spelling/Punctuation": "#7e57c2"
      "Fluency - Register": "#5c6bc0"
      "Terminology": "#26a69a"
      "Style": "#78909c"

  # Error type classification
  - name: "primary_error_type"
    description: "What is the primary (most severe) error type in this translation?"
    annotation_type: radio
    labels:
      - "Accuracy"
      - "Fluency"
      - "Terminology"
      - "Style"
      - "No errors found"
    keyboard_shortcuts:
      "Accuracy": "1"
      "Fluency": "2"
      "Terminology": "3"
      "Style": "4"
      "No errors found": "0"

  # Severity rating
  - name: "error_severity"
    description: "Rate the overall severity of errors in this translation."
    annotation_type: likert
    size: 5
    min_label: "1 - No errors"
    max_label: "5 - Critical errors"
    labels:
      - "1 - No errors (perfect translation)"
      - "2 - Minor errors (meaning preserved)"
      - "3 - Moderate errors (some meaning lost)"
      - "4 - Major errors (significant meaning loss)"
      - "5 - Critical errors (wrong or incomprehensible)"
    keyboard_shortcuts:
      "1 - No errors (perfect translation)": "q"
      "2 - Minor errors (meaning preserved)": "w"
      "3 - Moderate errors (some meaning lost)": "e"
      "4 - Major errors (significant meaning loss)": "r"
      "5 - Critical errors (wrong or incomprehensible)": "t"

  # Overall quality rating
  - name: "overall_quality"
    description: "Rate the overall translation quality."
    annotation_type: radio
    labels:
      - "Perfect"
      - "Good"
      - "Acceptable"
      - "Poor"
      - "Unacceptable"
    keyboard_shortcuts:
      "Perfect": "z"
      "Good": "x"
      "Acceptable": "c"
      "Poor": "v"
      "Unacceptable": "b"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 100
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "esa_001",
    "text": "The committee decided to postpone the meeting until next week due to the absence of several key members.",
    "source_text": "Das Komitee beschloss, die Sitzung auf nächste Woche zu verschieben, da mehrere wichtige Mitglieder abwesend waren.",
    "language_pair": "German-English"
  },
  {
    "id": "esa_002",
    "text": "The new policy will effect all employees starting from January, including those who work part-time in the remote offices.",
    "source_text": "Die neue Richtlinie wird ab Januar alle Mitarbeiter betreffen, einschließlich derjenigen, die in Teilzeit in den Außenstellen arbeiten.",
    "language_pair": "German-English"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/esa-mt-error-spans
potato start config.yaml

Dataset & paper

Kocmi et al., WMT@EMNLP 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{kocmi2024error,
  title={Error Span Annotation for Machine Translation Evaluation},
  author={Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond{\v{r}}ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and others},
  booktitle={Proceedings of the Ninth Conference on Machine Translation (WMT@EMNLP)},
  year={2024}
}

Details

Annotation Types

spanradiolikert

Domain

NLPMachine TranslationEvaluation

Use Cases

MT EvaluationError AnalysisTranslation Quality

Related Designs

LongEval: Faithfulness Evaluation for Long-Form Summarization

LongEval is the EACL 2023 protocol for human evaluation of faithfulness in long-form summaries (Krishna et al.). This Potato config reproduces its fine-grained, clause-level faithfulness judgments against source documents.

spanradio

News Headline Emotion Roles (GoodNewsEveryone)

Annotate emotions in news headlines with semantic roles. Based on Bostan et al., LREC 2020. Identify emotion, experiencer, cause, target, and textual cue.

likertradio

NLI with Explanations (e-SNLI)

Natural language inference with human explanations. Based on e-SNLI (Camburu et al., NeurIPS 2018). Classify entailment/contradiction/neutral and provide natural language justifications.