Showcase/LongEval: Faithfulness Evaluation for Long-Form Summarization

advancedsurvey

LongEval: Faithfulness Evaluation for Long-Form Summarization

LongEval is the EACL 2023 protocol for human evaluation of faithfulness in long-form summaries (Krishna et al.). This Potato config reproduces its fine-grained, clause-level faithfulness judgments against source documents.

About this dataset

LongEval is a set of guidelines for human evaluation of faithfulness in long-form summarization, introduced by Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo at EACL 2023, where it received an Outstanding Paper award. The authors surveyed 162 long-form summarization papers and found that most reported no human evaluation of model summaries at all.

The protocol asks annotators to judge faithfulness at a fine granularity rather than scoring a whole summary at once. Each summary is broken into smaller content units (clause-level spans), and an annotator checks each unit against the source document to decide whether it is supported. The paper shows this finer granularity cuts inter-annotator variance, lowering the standard deviation of faithfulness scores from 18.5 at the summary level to 6.8 at the clause level.

LongEval was applied to two long-form datasets from different domains: SQuALITY, built on literary works and books, and PubMed scientific articles. To keep annotation tractable on long documents, the protocol supports partial annotation, where scoring only a subset of fine-grained units still tracks the full workload closely (0.89 Kendall's tau using 50 percent of judgments). The released materials include human judgments, annotation templates, and a Python library.

The Potato config below reproduces the LongEval faithfulness task: annotators mark atomic content units in a summary as spans, check each unit against the source for faithfulness with a radio choice, and rate overall summary quality on a Likert scale.

Venue: EACL 2023 (Outstanding Paper)
Papers surveyed: 162 long-form summarization papers
Domains: Books (SQuALITY) and PubMed science
Judgment unit: Clause-level content units
Variance reduction: Score std-dev 18.5 -> 6.8
Partial annotation: 0.89 Kendall's tau at 50% of units

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# LongEval: Faithfulness Evaluation for Long-form Summarization
# Based on "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (Krishna et al., EACL 2023)
# Task: Identify content units in summaries and check faithfulness against source documents

annotation_task_name: "LongEval Faithfulness Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing source document and summary
html_layout: |
  <div class="longeval-container">
    <div class="source-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; max-height: 400px; overflow-y: auto;">
      <h3 style="margin-top: 0;">Source Document:</h3>
      <div class="source-text" style="white-space: pre-wrap; line-height: 1.6;">{{source_document}}</div>
    </div>
    <div class="model-info" style="background: #fff3e0; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
      <strong>Summary Model:</strong> {{summary_model}}
    </div>
    <div class="summary-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
      <h3 style="margin-top: 0; color: #1976d2;">Summary (highlight content units below):</h3>
      <div class="summary-text" style="font-size: 16px; line-height: 1.8;">{{text}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Span annotation for identifying atomic content units
  - name: "content_units"
    description: "Highlight atomic content units (individual claims or facts) in the summary. Each span should represent one verifiable claim."
    annotation_type: span
    labels:
      - "Faithful"
      - "Unfaithful"
      - "Partially Faithful"
      - "Unverifiable"
    label_colors:
      "Faithful": "#4caf50"
      "Unfaithful": "#f44336"
      "Partially Faithful": "#ff9800"
      "Unverifiable": "#9e9e9e"

  # Per-unit faithfulness verdict
  - name: "faithfulness_verdict"
    description: "What is the overall faithfulness verdict for the most problematic content unit?"
    annotation_type: radio
    labels:
      - "Faithful - all claims supported by source"
      - "Unfaithful - contains claims contradicting source"
      - "Partially Faithful - some claims supported, others not"
      - "Unverifiable - claims cannot be checked against source"
    keyboard_shortcuts:
      "Faithful - all claims supported by source": "1"
      "Unfaithful - contains claims contradicting source": "2"
      "Partially Faithful - some claims supported, others not": "3"
      "Unverifiable - claims cannot be checked against source": "4"

  # Overall faithfulness rating
  - name: "overall_faithfulness"
    description: "Rate the overall faithfulness of the entire summary on a 1-5 scale."
    annotation_type: likert
    size: 5
    min_label: "1 - Completely unfaithful"
    max_label: "5 - Completely faithful"
    labels:
      - "1 - Completely unfaithful (major fabrications)"
      - "2 - Mostly unfaithful (several unsupported claims)"
      - "3 - Mixed (some faithful, some unfaithful)"
      - "4 - Mostly faithful (minor inaccuracies)"
      - "5 - Completely faithful (all claims supported)"
    keyboard_shortcuts:
      "1 - Completely unfaithful (major fabrications)": "q"
      "2 - Mostly unfaithful (several unsupported claims)": "w"
      "3 - Mixed (some faithful, some unfaithful)": "e"
      "4 - Mostly faithful (minor inaccuracies)": "r"
      "5 - Completely faithful (all claims supported)": "t"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 3

Sample Datasample-data.json

json

[
  {
    "id": "le_001",
    "text": "The study found that regular consumption of green tea reduces the risk of heart disease by 25%. The researchers analyzed data from 50,000 participants over a 10-year period in Japan. They concluded that the catechins in green tea have a protective effect on cardiovascular health.",
    "source_document": "A large-scale prospective study conducted in Japan examined the relationship between green tea consumption and cardiovascular disease. The study followed 40,530 adults aged 40-79 for 11 years. Results showed that participants who consumed 5 or more cups of green tea per day had a 26% lower risk of death from cardiovascular disease compared to those who drank less than one cup per day. The researchers attributed this benefit to the polyphenolic compounds found in green tea, particularly catechins, which have antioxidant and anti-inflammatory properties.",
    "summary_model": "GPT-4"
  },
  {
    "id": "le_002",
    "text": "The Paris Agreement, signed in 2015, commits all signatory nations to limit global warming to 1.5 degrees Celsius above pre-industrial levels. Under the agreement, each country must submit nationally determined contributions every five years. The United States was the first major country to rejoin the agreement after initially withdrawing.",
    "source_document": "The Paris Agreement is an international treaty on climate change adopted on December 12, 2015, at COP21 in Paris, France. The treaty aims to limit the global average temperature increase to well below 2 degrees Celsius above pre-industrial levels, while pursuing efforts to limit it to 1.5 degrees Celsius. Each party to the agreement is required to submit nationally determined contributions (NDCs) that outline their climate action plans. These NDCs must be updated every five years, with each successive plan expected to be more ambitious. The United States formally withdrew from the agreement in November 2020 under the Trump administration, but rejoined on February 19, 2021, under the Biden administration.",
    "summary_model": "Claude-3"
  }
]

// ... and 7 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/longeval-faithfulness
potato start config.yaml

Dataset & paper

Krishna et al., EACL 2023

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{krishna2023longeval,
  title={LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization},
  author={Krishna, Kalpesh and Bransom, Erin and Kuehl, Bailey and Iyyer, Mohit and Dasigi, Pradeep and Cohan, Arman and Lo, Kyle},
  booktitle={Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2023}
}

Details

Annotation Types

spanradiolikert

Domain

NLPSummarizationEvaluation

Use Cases

Faithfulness EvaluationSummarization QualityHallucination Detection

Related Designs

ESA: Error Span Annotation for Machine Translation

Error span annotation for machine translation output. Annotators identify error spans in translations, classify error types (accuracy, fluency, terminology, style), and rate severity.

spanradio

News Headline Emotion Roles (GoodNewsEveryone)

Annotate emotions in news headlines with semantic roles. Based on Bostan et al., LREC 2020. Identify emotion, experiencer, cause, target, and textual cue.

likertradio

NLI with Explanations (e-SNLI)

Natural language inference with human explanations. Based on e-SNLI (Camburu et al., NeurIPS 2018). Classify entailment/contradiction/neutral and provide natural language justifications.