Summary Preference Comparison

Pairwise comparison of text summaries with axis-based quality ratings. Annotators select preferred summaries and rate them on accuracy, coverage, and coherence for reward model training.

Configuration Fileconfig.yaml

# Summary Preference Comparison Configuration
# Based on OpenAI summarize_from_feedback (Stiennon et al., NeurIPS 2020)
# Task: Compare two summaries and rate quality on multiple axes

annotation_task_name: "Summary Preference Comparison"
task_dir: "."

# Data configuration
data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "source_text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing source text and both summaries
html_layout: |
  <div class="summary-comparison">
    <div class="source-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 20px; max-height: 300px; overflow-y: auto;">
      <h3 style="margin-top: 0;">📄 Original Text:</h3>
      <div class="source-text" style="font-size: 14px; line-height: 1.6;">{{source_text}}</div>
    </div>
    <div class="summaries-row" style="display: flex; gap: 20px;">
      <div class="summary-a" style="flex: 1; background: #e8f5e9; padding: 15px; border-radius: 8px;">
        <h3 style="margin-top: 0; color: #2e7d32;">Summary A:</h3>
        <div class="summary-text">{{summary_a}}</div>
      </div>
      <div class="summary-b" style="flex: 1; background: #e3f2fd; padding: 15px; border-radius: 8px;">
        <h3 style="margin-top: 0; color: #1565c0;">Summary B:</h3>
        <div class="summary-text">{{summary_b}}</div>
      </div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Overall preference
  - name: "preference"
    description: "Which summary is better overall?"
    annotation_type: radio
    labels:
      - "Summary A is clearly better"
      - "Summary A is slightly better"
      - "About the same"
      - "Summary B is slightly better"
      - "Summary B is clearly better"
    keyboard_shortcuts:
      "Summary A is clearly better": "1"
      "Summary A is slightly better": "2"
      "About the same": "3"
      "Summary B is slightly better": "4"
      "Summary B is clearly better": "5"

  # Confidence in preference
  - name: "confidence"
    description: "How confident are you in your preference?"
    annotation_type: likert
    size: 5
    min_label: "Not confident"
    max_label: "Very confident"
    labels:
      - "1 - Wild guess"
      - "2 - Somewhat uncertain"
      - "3 - Moderately confident"
      - "4 - Fairly confident"
      - "5 - Very confident"

  # Axis ratings for Summary A
  - name: "summary_a_accuracy"
    description: "Summary A - Accuracy: Does it only contain information from the source text?"
    annotation_type: likert
    size: 7
    min_label: "1 - Many errors"
    max_label: "7 - Perfectly accurate"
    labels:
      - "1 - Contains major false information"
      - "2"
      - "3"
      - "4 - Some minor inaccuracies"
      - "5"
      - "6"
      - "7 - Completely accurate"

  - name: "summary_a_coverage"
    description: "Summary A - Coverage: Does it capture the main points of the original?"
    annotation_type: likert
    size: 7
    min_label: "1 - Missing key info"
    max_label: "7 - Complete coverage"
    labels:
      - "1 - Misses most important points"
      - "2"
      - "3"
      - "4 - Captures some main points"
      - "5"
      - "6"
      - "7 - Covers all important points"

  - name: "summary_a_coherence"
    description: "Summary A - Coherence: Is it well-written and easy to understand?"
    annotation_type: likert
    size: 7
    min_label: "1 - Incoherent"
    max_label: "7 - Perfectly clear"
    labels:
      - "1 - Confusing and poorly written"
      - "2"
      - "3"
      - "4 - Understandable but awkward"
      - "5"
      - "6"
      - "7 - Clear and well-written"

  # Axis ratings for Summary B
  - name: "summary_b_accuracy"
    description: "Summary B - Accuracy: Does it only contain information from the source text?"
    annotation_type: likert
    size: 7
    min_label: "1 - Many errors"
    max_label: "7 - Perfectly accurate"
    labels:
      - "1 - Contains major false information"
      - "2"
      - "3"
      - "4 - Some minor inaccuracies"
      - "5"
      - "6"
      - "7 - Completely accurate"

  - name: "summary_b_coverage"
    description: "Summary B - Coverage: Does it capture the main points of the original?"
    annotation_type: likert
    size: 7
    min_label: "1 - Missing key info"
    max_label: "7 - Complete coverage"
    labels:
      - "1 - Misses most important points"
      - "2"
      - "3"
      - "4 - Captures some main points"
      - "5"
      - "6"
      - "7 - Covers all important points"

  - name: "summary_b_coherence"
    description: "Summary B - Coherence: Is it well-written and easy to understand?"
    annotation_type: likert
    size: 7
    min_label: "1 - Incoherent"
    max_label: "7 - Perfectly clear"
    labels:
      - "1 - Confusing and poorly written"
      - "2"
      - "3"
      - "4 - Understandable but awkward"
      - "5"
      - "6"
      - "7 - Clear and well-written"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 100
annotation_per_instance: 2

# Instructions
annotation_instructions: |
  ## Summary Comparison Task

  Your goal is to compare two summaries of the same text and evaluate their quality.

  ### Step 1: Read the Original Text
  - Understand the main points and key information
  - Note what's most important to include in a summary

  ### Step 2: Compare Summaries
  - Read both Summary A and Summary B
  - Decide which one is better overall

  ### Step 3: Rate Each Summary on 3 Axes

  **Accuracy (1-7)**
  - Does the summary contain ONLY true information from the source?
  - Deduct points for: fabricated details, misrepresentation, factual errors
  - A summary can be accurate but incomplete

  **Coverage (1-7)**
  - Does the summary include the MAIN POINTS?
  - Deduct points for: missing key information, unbalanced emphasis
  - A good summary captures what's most important

  **Coherence (1-7)**
  - Is the summary WELL-WRITTEN and easy to follow?
  - Deduct points for: grammatical errors, awkward phrasing, poor flow
  - Consider readability independent of content

  ### What Makes a Good Summary?
  - Captures the essential information
  - Is factually accurate (no hallucinations)
  - Is concise without losing important details
  - Reads smoothly and is easy to understand

  ### Tips:
  - Re-read the source if needed to verify accuracy
  - A shorter summary isn't always better
  - Consider what a reader who only sees the summary would understand

Sample Datasample-data.json

[
{
"id": "sum_001",
"source_text": "Scientists at MIT have developed a new type of battery that could revolutionize electric vehicles. The solid-state battery uses a lithium-metal anode instead of the graphite anodes in current lithium-ion batteries, potentially doubling energy density. In tests, the battery maintained 80% capacity after 10,000 charge cycles, far exceeding current batteries which typically last 1,000-2,000 cycles. The researchers say the technology could be commercially available within 5 years, though manufacturing challenges remain. Major automakers including Toyota and BMW have expressed interest in licensing the technology.",
"summary_a": "MIT scientists created a new solid-state battery with lithium-metal anode that doubles energy density and lasts 10,000 cycles. Commercial availability expected in 5 years, with Toyota and BMW interested in the technology.",
"summary_b": "A revolutionary new battery technology has been developed that will change everything about electric cars. The battery is much better than current ones and big car companies want to use it."
},
{
"id": "sum_002",
"source_text": "The city council voted 7-2 last night to approve a controversial new zoning ordinance that will allow higher-density housing developments in traditionally single-family neighborhoods. Supporters argue the measure will address the housing shortage and make homes more affordable for young families. Opponents, including several neighborhood associations, say it will increase traffic, strain schools, and change the character of established communities. The ordinance takes effect in 90 days but faces a potential ballot challenge from a coalition of homeowner groups who are gathering signatures for a referendum.",
"summary_a": "City council approved (7-2) a zoning ordinance allowing higher-density housing in single-family areas to address housing shortage. Opponents cite traffic and school concerns. The ordinance takes effect in 90 days but may face a referendum challenge.",
"summary_b": "The city approved new housing rules despite opposition from homeowner groups who are concerned."
}
]

// ... and 3 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/preference-learning/summary-preference-comparison
potato start config.yaml

Details

Annotation Types

radiolikert

Domain

NLPSummarization

Use Cases

Summary EvaluationRLHFText Quality

Related Designs

LongEval: Faithfulness Evaluation for Long-form Summarization

Faithfulness evaluation of long-form summaries. Annotators identify atomic content units in summaries, check each against source documents for faithfulness, and rate overall summary quality.

spanradio

Moral Stories Annotation

Annotate moral reasoning in situated narratives. Based on Emelin et al., EMNLP 2021. Evaluate whether actions adhere to or diverge from social norms given situations and intentions.

likertradio

SafeRLHF Dual-Dimension Preference

Safety-aware preference annotation with separate judgments for helpfulness and harmlessness. Includes safety category labeling across 19 harm types for constrained AI alignment.

radiomultiselect