Showcase/SHROOM: Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

beginnersurvey

SHROOM: Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Binary hallucination detection in NLG outputs. Annotators judge whether a generated text contains hallucinations (overgeneration) relative to the input, with confidence rating. Covers three NLG tasks: machine translation (MT), definition modeling (DM), and paraphrase generation (PG).

Configuration Fileconfig.yaml

yaml

# SHROOM: Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
# Based on Mickus et al., SemEval@NAACL 2024
# Paper: https://aclanthology.org/2024.semeval-1.289/
# Dataset: https://github.com/Helsinki-NLP/shroom
#
# This task performs binary hallucination detection in NLG outputs.
# Annotators judge whether a model-generated text contains hallucinations
# (overgeneration mistakes) relative to the input.
#
# Task Types:
# - MT (Machine Translation): Does the translation add/change meaning?
# - DM (Definition Modeling): Does the definition match the target word?
# - PG (Paraphrase Generation): Does the paraphrase preserve meaning?
#
# Hallucination Label:
# - Hallucination: The output contains information not supported by or
#   contradicting the input (overgeneration)
# - Not Hallucination: The output faithfully represents the input
#
# Annotation Guidelines:
# 1. Read the input text carefully
# 2. Read the model output (generated text)
# 3. Determine if the output adds, changes, or fabricates information
# 4. For MT: Check if the translation adds meaning not in the source
# 5. For DM: Check if the definition matches the word in context
# 6. For PG: Check if the paraphrase changes the original meaning
# 7. Rate your confidence in the judgment

annotation_task_name: "SHROOM: Hallucination Detection"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  # Step 1: Hallucination judgment
  - annotation_type: radio
    name: hallucination_label
    description: "Does the model output contain a hallucination (overgeneration) relative to the input?"
    labels:
      - "Hallucination"
      - "Not Hallucination"
    keyboard_shortcuts:
      "Hallucination": "h"
      "Not Hallucination": "n"
    tooltips:
      "Hallucination": "The output contains information that is not supported by or contradicts the input (added meaning, fabricated details, or changed facts)"
      "Not Hallucination": "The output faithfully represents the input without adding or changing meaning"

  # Step 2: Confidence rating
  - annotation_type: likert
    name: confidence
    description: "How confident are you in your hallucination judgment?"
    min_value: 1
    max_value: 5
    labels:
      1: "Very uncertain"
      2: "Somewhat uncertain"
      3: "Moderately confident"
      4: "Confident"
      5: "Very confident"

html_layout: |
  <div style="margin-bottom: 8px; padding: 6px; background: #e0e7ff; border-radius: 4px; font-size: 13px;">
    <strong>Task Type:</strong> {{task_type}}
  </div>
  <div style="margin-bottom: 10px; padding: 10px; background: #eff6ff; border-left: 4px solid #3b82f6; border-radius: 4px;">
    <strong>Input:</strong><br>{{input_text}}
  </div>
  <div style="margin-bottom: 10px; padding: 10px; background: #fef3c7; border-left: 4px solid #f59e0b; border-radius: 4px;">
    <strong>Model Output:</strong><br>{{text}}
  </div>

allow_all_users: true
instances_per_annotator: 150
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "shroom_001",
    "text": "The cat sat on the mat and watched the birds flying outside the window.",
    "input_text": "Le chat etait assis sur le tapis et regardait les oiseaux voler dehors.",
    "task_type": "MT"
  },
  {
    "id": "shroom_002",
    "text": "The government announced new tax reforms that will reduce income tax by 15% for all citizens starting next year.",
    "input_text": "The government announced new tax reforms that will affect income tax rates starting next year.",
    "task_type": "PG"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/fact-verification/shroom-hallucination-detection
potato start config.yaml

Details

Annotation Types

radiolikert

Domain

NLPHallucination Detection

Use Cases

Hallucination DetectionNLG EvaluationQuality Assessment

Related Designs

AnnoMI Counselling Dialogue Annotation

Annotation of motivational interviewing counselling dialogues based on the AnnoMI dataset. Annotators label therapist and client utterances for MI techniques (open questions, reflections, affirmations) and client change talk (sustain talk, change talk), with quality ratings for therapeutic interactions.

radiomultiselect

Argument Reasoning Comprehension (ARCT)

Identify implicit warrants in arguments. Based on Habernal et al., NAACL 2018 / SemEval 2018 Task 12. Given a claim and premise, choose the correct warrant that connects them.

likertradio

Assessing Humor in Edited News Headlines

Rate the funniness of edited news headlines and compare humor between original and edited versions, based on SemEval-2020 Task 7 (Hossain et al.). Headlines are minimally edited by replacing a single word to create humorous effect.