WebGPT Answer Comparison

Compare answers with and without web search access. Evaluate whether web-augmented responses are more accurate, helpful, and well-sourced.

ملف الإعدادconfig.yaml

# WebGPT Answer Comparison Configuration
# Based on Nakano et al., 2021
# Task: Compare web-augmented vs standard AI answers

annotation_task_name: "WebGPT Answer Comparison"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "overall_preference"
    description: |
      Which answer is BETTER overall?
      Consider accuracy, helpfulness, and sourcing.
    annotation_type: radio
    labels:
      - "Response A is much better"
      - "Response A is slightly better"
      - "About equal"
      - "Response B is slightly better"
      - "Response B is much better"

  - name: "factual_accuracy"
    description: "Which response is more FACTUALLY ACCURATE?"
    annotation_type: radio
    labels:
      - "A is more accurate"
      - "Both equally accurate"
      - "B is more accurate"
      - "Cannot determine"

  - name: "source_quality"
    description: "Which response has BETTER SOURCES/CITATIONS?"
    annotation_type: radio
    labels:
      - "A has better sources"
      - "Both have similar source quality"
      - "B has better sources"
      - "Neither provides sources"

  - name: "completeness"
    description: "Which response is more COMPLETE?"
    annotation_type: radio
    labels:
      - "A is more complete"
      - "Both equally complete"
      - "B is more complete"

  - name: "relevance"
    description: "Which response better ADDRESSES THE QUESTION?"
    annotation_type: radio
    labels:
      - "A is more relevant"
      - "Both equally relevant"
      - "B is more relevant"

  - name: "confidence"
    description: "How confident are you in your overall preference?"
    annotation_type: radio
    labels:
      - "Very confident"
      - "Somewhat confident"
      - "Not very confident"
      - "Guessing"

allow_all_users: true
instances_per_annotator: 80
annotation_per_instance: 3

annotation_instructions: |
  ## WebGPT Answer Comparison

  Compare two AI-generated answers to the same question.

  ### What to evaluate:

  **Factual Accuracy**
  - Are the facts correct?
  - Are claims supported?
  - Any obvious errors?

  **Source Quality**
  - Are sources provided?
  - Are sources reliable?
  - Do sources support the claims?

  **Completeness**
  - Does it fully answer the question?
  - Are important aspects covered?
  - Appropriate level of detail?

  **Relevance**
  - Does it address what was asked?
  - Is the information pertinent?
  - Stays on topic?

  ### Guidelines:
  - Read the question carefully first
  - Evaluate each dimension independently
  - "About equal" is valid when truly indistinguishable
  - Note: One response may have used web search

  ### Tips:
  - Look for specific details and citations
  - Check if claims seem verifiable
  - Consider whether you'd trust the answer
  - More sources ≠ better (quality matters)

بيانات نموذجيةsample-data.json

[
  {
    "id": "webgpt_001",
    "question": "What is the tallest building in the world as of 2024?",
    "response_a": "The Burj Khalifa in Dubai is the tallest building in the world, standing at 828 meters (2,717 feet). It was completed in 2010 and has held the record since then.",
    "response_b": "The Burj Khalifa in Dubai, UAE is the tallest building in the world at 828 meters [1]. It surpassed Taipei 101 when completed in 2010. The Jeddah Tower in Saudi Arabia is under construction and planned to exceed 1,000 meters when completed [2].\n\nSources:\n[1] https://www.burjkhalifa.ae\n[2] https://www.jeddahtower.com"
  },
  {
    "id": "webgpt_002",
    "question": "How does CRISPR gene editing work?",
    "response_a": "CRISPR uses a guide RNA to direct the Cas9 enzyme to a specific location in DNA. The enzyme cuts both strands, and the cell's repair mechanisms either disable the gene or allow insertion of new sequences.",
    "response_b": "CRISPR-Cas9 is a gene editing tool adapted from bacterial immune systems. A guide RNA (gRNA) complementary to the target sequence directs Cas9 to make a double-strand break. This triggers either non-homologous end joining (NHEJ), causing insertions/deletions, or homology-directed repair (HDR) for precise edits when a template is provided."
  }
]

احصل على هذا التصميم

View on GitHub

Clone or download from the repository

بدء سريع:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/preference-learning/webgpt-comparison
potato start config.yaml

التفاصيل

أنواع التوسيم

radio

المجال

Information RetrievalQuestion Answering

حالات الاستخدام

Preference LearningQA EvaluationRetrieval Augmentation

الوسوم

preferencecomparisonwebretrievalqasources

وجدت مشكلة أو تريد تحسين هذا التصميم؟

افتح مشكلة

تصاميم ذات صلة

Bias Benchmark for QA (BBQ)

Annotate question-answering examples designed to probe social biases. Based on BBQ (Parrish et al., Findings of ACL 2022). Annotators select the correct answer given a context, assess the direction of bias in the question, categorize the type of bias, and explain their reasoning.

radiotext

BRAINTEASER - Commonsense-Defying QA

Lateral thinking and commonsense-defying question answering task requiring annotators to select answers to brain teasers that defy default commonsense assumptions and provide explanations. Based on SemEval-2024 Task 9 (BRAINTEASER).

radiotext

Community Question Answering

Answer quality classification for community forum question-answer pairs, based on SemEval-2017 Task 3. Annotators judge whether a given answer is good, potentially useful, or bad with respect to the original question.

radio