FLORES - Machine Translation Quality Estimation

Machine translation quality assessment using the FLORES-101 benchmark (Goyal et al., TACL 2022). Annotators rate translation quality on a Likert scale, identify error categories, and provide detailed error notes.

Configuration Fileconfig.yaml

yaml

# FLORES - Machine Translation Quality Estimation
# Based on Goyal et al., TACL 2022
# Paper: https://aclanthology.org/2022.tacl-1.21/
# Dataset: https://github.com/facebookresearch/flores
#
# This task evaluates machine translation quality by presenting source
# text alongside its translation. Annotators rate overall quality on
# a 5-point Likert scale, identify the primary error category, and
# provide detailed notes about specific errors found.
#
# Quality Scale:
# 1 - Incomprehensible: Translation is unreadable or completely wrong
# 2 - Poor: Major errors that significantly impair understanding
# 3 - Acceptable: Some errors but the meaning is mostly conveyed
# 4 - Good: Minor errors that do not affect understanding
# 5 - Perfect: Flawless translation with natural fluency
#
# Error Categories:
# - Accuracy: Mistranslation, omission, or addition of meaning
# - Fluency: Unnatural phrasing, grammar, or word choice
# - Terminology: Incorrect domain-specific terms
# - Style: Inappropriate register, tone, or formality level
# - No Error: Translation is correct and natural
#
# Annotation Guidelines:
# 1. Read the source text carefully
# 2. Read the translation and compare it to the source
# 3. Rate overall quality on the 1-5 scale
# 4. Select the most prominent error category
# 5. Provide specific notes about errors found

annotation_task_name: "FLORES - Machine Translation Quality Estimation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Rate overall translation quality
  - annotation_type: likert
    name: translation_quality
    description: "Rate the overall quality of this translation"
    min_label: "Incomprehensible"
    max_label: "Perfect"
    size: 5

  # Step 2: Identify the primary error category
  - annotation_type: select
    name: error_category
    description: "What is the primary error category in this translation?"
    labels:
      - "Accuracy"
      - "Fluency"
      - "Terminology"
      - "Style"
      - "No Error"
    tooltips:
      "Accuracy": "Mistranslation, omission, or addition of meaning not present in the source"
      "Fluency": "Unnatural phrasing, grammatical errors, or awkward word choice in the target"
      "Terminology": "Incorrect or inconsistent use of domain-specific terms"
      "Style": "Inappropriate register, tone, or formality level for the context"
      "No Error": "The translation is correct and reads naturally"

  # Step 3: Provide error notes
  - annotation_type: text
    name: error_notes
    description: "Describe specific errors you found in the translation (if any)"

annotation_instructions: |
  You will be shown a source text and its machine translation. Your task is to:
  1. Read the source text carefully.
  2. Read the translation and compare it to the source.
  3. Rate the overall translation quality on a 1 (Incomprehensible) to 5 (Perfect) scale.
  4. Select the most prominent error category (or "No Error" if the translation is correct).
  5. Provide specific notes about any errors you identified.

  Focus on whether the translation accurately conveys the meaning, reads naturally in the
  target language, and uses appropriate terminology and style.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="display: flex; gap: 8px; margin-bottom: 12px;">
      <span style="background: #dbeafe; color: #1e40af; padding: 3px 10px; border-radius: 12px; font-size: 13px;">Source: {{source_lang}}</span>
      <span style="background: #dcfce7; color: #166534; padding: 3px 10px; border-radius: 12px; font-size: 13px;">Target: {{target_lang}}</span>
    </div>
    <div style="background: #eff6ff; border: 1px solid #bfdbfe; border-radius: 8px; padding: 16px; margin-bottom: 12px;">
      <strong style="color: #1d4ed8;">Source Text:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px; margin-bottom: 12px;">
      <strong style="color: #166534;">Translation:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{translation}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "flores_001",
    "text": "The researchers published their findings in a peer-reviewed journal, demonstrating a significant correlation between air pollution and respiratory disease.",
    "translation": "Los investigadores publicaron sus hallazgos en una revista revisada por pares, demostrando una correlacion significativa entre la contaminacion del aire y las enfermedades respiratorias.",
    "source_lang": "English",
    "target_lang": "Spanish"
  },
  {
    "id": "flores_002",
    "text": "The ancient temple was discovered during a routine archaeological survey of the region, dating back to approximately 300 BCE.",
    "translation": "Le temple ancien a ete decouvert lors d'une enquete archeologique de routine de la region, remontant a environ 300 avant notre ere.",
    "source_lang": "English",
    "target_lang": "French"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/cross-lingual/flores-mt-quality
potato start config.yaml

Details

Annotation Types

likerttextselect

Domain

NLPMachine TranslationCross-Lingual

Use Cases

Translation QualityQuality EstimationMachine Translation Evaluation

Related Designs

Automated Essay Scoring

Holistic and analytic scoring of student essays using a deep-neural approach to automated essay scoring (Uto, arXiv 2022). Annotators provide overall quality ratings, holistic scores on a 1-6 scale, and detailed feedback comments for educational assessment.

likertslider

Coreference Resolution (OntoNotes)

Link pronouns and noun phrases to the entities they refer to in text. Based on the OntoNotes coreference annotation guidelines and CoNLL shared tasks. Identify mention spans and cluster coreferent mentions together.

likertradio

FinBERT - Financial Headline Sentiment Analysis

Classify sentiment of financial news headlines as positive, negative, or neutral, based on the FinBERT model (Araci, arXiv 2019). Annotators also rate market outlook on a bearish-to-bullish scale and provide reasoning for their sentiment judgment.

radiotext