Skip to content
Showcase/LongEval: Faithfulness Evaluation for Long-form Summarization
advancedsurvey

LongEval: Faithfulness Evaluation for Long-form Summarization

Faithfulness evaluation of long-form summaries. Annotators identify atomic content units in summaries, check each against source documents for faithfulness, and rate overall summary quality.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# LongEval: Faithfulness Evaluation for Long-form Summarization
# Based on "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (Krishna et al., EACL 2023)
# Task: Identify content units in summaries and check faithfulness against source documents

annotation_task_name: "LongEval Faithfulness Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing source document and summary
html_layout: |
  <div class="longeval-container">
    <div class="source-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; max-height: 400px; overflow-y: auto;">
      <h3 style="margin-top: 0;">Source Document:</h3>
      <div class="source-text" style="white-space: pre-wrap; line-height: 1.6;">{{source_document}}</div>
    </div>
    <div class="model-info" style="background: #fff3e0; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
      <strong>Summary Model:</strong> {{summary_model}}
    </div>
    <div class="summary-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
      <h3 style="margin-top: 0; color: #1976d2;">Summary (highlight content units below):</h3>
      <div class="summary-text" style="font-size: 16px; line-height: 1.8;">{{text}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Span annotation for identifying atomic content units
  - name: "content_units"
    description: "Highlight atomic content units (individual claims or facts) in the summary. Each span should represent one verifiable claim."
    annotation_type: span
    labels:
      - "Faithful"
      - "Unfaithful"
      - "Partially Faithful"
      - "Unverifiable"
    label_colors:
      "Faithful": "#4caf50"
      "Unfaithful": "#f44336"
      "Partially Faithful": "#ff9800"
      "Unverifiable": "#9e9e9e"

  # Per-unit faithfulness verdict
  - name: "faithfulness_verdict"
    description: "What is the overall faithfulness verdict for the most problematic content unit?"
    annotation_type: radio
    labels:
      - "Faithful - all claims supported by source"
      - "Unfaithful - contains claims contradicting source"
      - "Partially Faithful - some claims supported, others not"
      - "Unverifiable - claims cannot be checked against source"
    keyboard_shortcuts:
      "Faithful - all claims supported by source": "1"
      "Unfaithful - contains claims contradicting source": "2"
      "Partially Faithful - some claims supported, others not": "3"
      "Unverifiable - claims cannot be checked against source": "4"

  # Overall faithfulness rating
  - name: "overall_faithfulness"
    description: "Rate the overall faithfulness of the entire summary on a 1-5 scale."
    annotation_type: likert
    size: 5
    min_label: "1 - Completely unfaithful"
    max_label: "5 - Completely faithful"
    labels:
      - "1 - Completely unfaithful (major fabrications)"
      - "2 - Mostly unfaithful (several unsupported claims)"
      - "3 - Mixed (some faithful, some unfaithful)"
      - "4 - Mostly faithful (minor inaccuracies)"
      - "5 - Completely faithful (all claims supported)"
    keyboard_shortcuts:
      "1 - Completely unfaithful (major fabrications)": "q"
      "2 - Mostly unfaithful (several unsupported claims)": "w"
      "3 - Mixed (some faithful, some unfaithful)": "e"
      "4 - Mostly faithful (minor inaccuracies)": "r"
      "5 - Completely faithful (all claims supported)": "t"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 3

Sample Datasample-data.json

[
  {
    "id": "le_001",
    "text": "The study found that regular consumption of green tea reduces the risk of heart disease by 25%. The researchers analyzed data from 50,000 participants over a 10-year period in Japan. They concluded that the catechins in green tea have a protective effect on cardiovascular health.",
    "source_document": "A large-scale prospective study conducted in Japan examined the relationship between green tea consumption and cardiovascular disease. The study followed 40,530 adults aged 40-79 for 11 years. Results showed that participants who consumed 5 or more cups of green tea per day had a 26% lower risk of death from cardiovascular disease compared to those who drank less than one cup per day. The researchers attributed this benefit to the polyphenolic compounds found in green tea, particularly catechins, which have antioxidant and anti-inflammatory properties.",
    "summary_model": "GPT-4"
  },
  {
    "id": "le_002",
    "text": "The Paris Agreement, signed in 2015, commits all signatory nations to limit global warming to 1.5 degrees Celsius above pre-industrial levels. Under the agreement, each country must submit nationally determined contributions every five years. The United States was the first major country to rejoin the agreement after initially withdrawing.",
    "source_document": "The Paris Agreement is an international treaty on climate change adopted on December 12, 2015, at COP21 in Paris, France. The treaty aims to limit the global average temperature increase to well below 2 degrees Celsius above pre-industrial levels, while pursuing efforts to limit it to 1.5 degrees Celsius. Each party to the agreement is required to submit nationally determined contributions (NDCs) that outline their climate action plans. These NDCs must be updated every five years, with each successive plan expected to be more ambitious. The United States formally withdrew from the agreement in November 2020 under the Trump administration, but rejoined on February 19, 2021, under the Biden administration.",
    "summary_model": "Claude-3"
  }
]

// ... and 7 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/longeval-faithfulness
potato start config.yaml

Details

Annotation Types

spanradiolikert

Domain

NLPSummarizationEvaluation

Use Cases

Faithfulness EvaluationSummarization QualityHallucination Detection

Tags

summarizationfaithfulnesslong-formcontent-unitshallucinationevaluation

Found an issue or want to improve this design?

Open an Issue