LongEval: Faithfulness Evaluation for Long-form Summarization
Faithfulness evaluation of long-form summaries. Annotators identify atomic content units in summaries, check each against source documents for faithfulness, and rate overall summary quality.
Configuration Fileconfig.yaml
# LongEval: Faithfulness Evaluation for Long-form Summarization
# Based on "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (Krishna et al., EACL 2023)
# Task: Identify content units in summaries and check faithfulness against source documents
annotation_task_name: "LongEval Faithfulness Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing source document and summary
html_layout: |
<div class="longeval-container">
<div class="source-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; max-height: 400px; overflow-y: auto;">
<h3 style="margin-top: 0;">Source Document:</h3>
<div class="source-text" style="white-space: pre-wrap; line-height: 1.6;">{{source_document}}</div>
</div>
<div class="model-info" style="background: #fff3e0; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
<strong>Summary Model:</strong> {{summary_model}}
</div>
<div class="summary-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
<h3 style="margin-top: 0; color: #1976d2;">Summary (highlight content units below):</h3>
<div class="summary-text" style="font-size: 16px; line-height: 1.8;">{{text}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Span annotation for identifying atomic content units
- name: "content_units"
description: "Highlight atomic content units (individual claims or facts) in the summary. Each span should represent one verifiable claim."
annotation_type: span
labels:
- "Faithful"
- "Unfaithful"
- "Partially Faithful"
- "Unverifiable"
label_colors:
"Faithful": "#4caf50"
"Unfaithful": "#f44336"
"Partially Faithful": "#ff9800"
"Unverifiable": "#9e9e9e"
# Per-unit faithfulness verdict
- name: "faithfulness_verdict"
description: "What is the overall faithfulness verdict for the most problematic content unit?"
annotation_type: radio
labels:
- "Faithful - all claims supported by source"
- "Unfaithful - contains claims contradicting source"
- "Partially Faithful - some claims supported, others not"
- "Unverifiable - claims cannot be checked against source"
keyboard_shortcuts:
"Faithful - all claims supported by source": "1"
"Unfaithful - contains claims contradicting source": "2"
"Partially Faithful - some claims supported, others not": "3"
"Unverifiable - claims cannot be checked against source": "4"
# Overall faithfulness rating
- name: "overall_faithfulness"
description: "Rate the overall faithfulness of the entire summary on a 1-5 scale."
annotation_type: likert
size: 5
min_label: "1 - Completely unfaithful"
max_label: "5 - Completely faithful"
labels:
- "1 - Completely unfaithful (major fabrications)"
- "2 - Mostly unfaithful (several unsupported claims)"
- "3 - Mixed (some faithful, some unfaithful)"
- "4 - Mostly faithful (minor inaccuracies)"
- "5 - Completely faithful (all claims supported)"
keyboard_shortcuts:
"1 - Completely unfaithful (major fabrications)": "q"
"2 - Mostly unfaithful (several unsupported claims)": "w"
"3 - Mixed (some faithful, some unfaithful)": "e"
"4 - Mostly faithful (minor inaccuracies)": "r"
"5 - Completely faithful (all claims supported)": "t"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 3
Sample Datasample-data.json
[
{
"id": "le_001",
"text": "The study found that regular consumption of green tea reduces the risk of heart disease by 25%. The researchers analyzed data from 50,000 participants over a 10-year period in Japan. They concluded that the catechins in green tea have a protective effect on cardiovascular health.",
"source_document": "A large-scale prospective study conducted in Japan examined the relationship between green tea consumption and cardiovascular disease. The study followed 40,530 adults aged 40-79 for 11 years. Results showed that participants who consumed 5 or more cups of green tea per day had a 26% lower risk of death from cardiovascular disease compared to those who drank less than one cup per day. The researchers attributed this benefit to the polyphenolic compounds found in green tea, particularly catechins, which have antioxidant and anti-inflammatory properties.",
"summary_model": "GPT-4"
},
{
"id": "le_002",
"text": "The Paris Agreement, signed in 2015, commits all signatory nations to limit global warming to 1.5 degrees Celsius above pre-industrial levels. Under the agreement, each country must submit nationally determined contributions every five years. The United States was the first major country to rejoin the agreement after initially withdrawing.",
"source_document": "The Paris Agreement is an international treaty on climate change adopted on December 12, 2015, at COP21 in Paris, France. The treaty aims to limit the global average temperature increase to well below 2 degrees Celsius above pre-industrial levels, while pursuing efforts to limit it to 1.5 degrees Celsius. Each party to the agreement is required to submit nationally determined contributions (NDCs) that outline their climate action plans. These NDCs must be updated every five years, with each successive plan expected to be more ambitious. The United States formally withdrew from the agreement in November 2020 under the Trump administration, but rejoined on February 19, 2021, under the Biden administration.",
"summary_model": "Claude-3"
}
]
// ... and 7 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/longeval-faithfulness potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
ESA: Error Span Annotation for Machine Translation
Error span annotation for machine translation output. Annotators identify error spans in translations, classify error types (accuracy, fluency, terminology, style), and rate severity.
News Headline Emotion Roles (GoodNewsEveryone)
Annotate emotions in news headlines with semantic roles. Based on Bostan et al., LREC 2020. Identify emotion, experiencer, cause, target, and textual cue.
NLI with Explanations (e-SNLI)
Natural language inference with human explanations. Based on e-SNLI (Camburu et al., NeurIPS 2018). Classify entailment/contradiction/neutral and provide natural language justifications.