Showcase/GPQA - Graduate-Level Expert QA Evaluation

intermediateevaluation

GPQA - Graduate-Level Expert QA Evaluation

Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., COLM 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.

About this dataset

GPQA is a multiple-choice question answering benchmark of graduate-level science questions written by domain experts in biology, physics, and chemistry. It was built by David Rein and colleagues (primarily at New York University) and presented at the first Conference on Language Modeling (COLM) in 2024. It exists to provide questions that stay hard even for skilled people with full web access, so researchers can study whether humans can oversee AI systems that may outperform them.

Each question was written by an expert who holds or is pursuing a PhD in the relevant field, then reviewed by other experts and attempted by skilled non-experts who had unrestricted access to the web. Every item has one correct answer and three distractors. The annotation task is to read the question, choose the single correct option, and record the reasoning behind the choice.

The main set contains 448 questions across biology, physics, and chemistry; the release also includes an extended set of 546 questions and a harder diamond set of 198. Experts reached 65% accuracy (74% after discounting mistakes they later identified), while non-expert validators reached only 34% despite spending more than 30 minutes per question online. A GPT-4 based baseline scored 39%, which is why the questions are described as Google-proof.

The Potato config below reproduces this task with a radio scheme for the four answer options (A-D), a number field for a 0-100 confidence score, and a free-text box for a short explanation. Use it to collect expert answers, calibrated confidence, and reasoning on hard science questions, or to run a model-versus-human comparison.

Released: COLM 2024 (spotlight)
Questions: 448 (main set; 4-option MC)
Subsets: Main 448, Extended 546, Diamond 198
Domains: Biology, physics, chemistry
Expert accuracy: 65% (74% discounting mistakes)
Non-expert accuracy: 34% (30+ min web access)

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# GPQA - Graduate-Level Expert QA Evaluation
# Based on Rein et al., ICLR 2024
# Paper: https://arxiv.org/abs/2311.12022
# Dataset: https://github.com/idavidrein/gpqa
#
# This task evaluates graduate-level science questions from the GPQA benchmark.
# Annotators review a question with four answer options and select the correct
# answer, provide a confidence score, and write an explanation for their choice.
#
# Answer Options:
# - A, B, C, D: Four possible answers; exactly one is correct
#
# Annotation Guidelines:
# 1. Read the question carefully
# 2. Review all four answer options
# 3. Select the best answer
# 4. Rate your confidence (0-100)
# 5. Provide a brief explanation for your choice

annotation_task_name: "GPQA - Graduate-Level Expert QA Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: radio
    name: answer_choice
    description: "Select the correct answer from the four options"
    labels:
      - "A"
      - "B"
      - "C"
      - "D"
    keyboard_shortcuts:
      "A": "1"
      "B": "2"
      "C": "3"
      "D": "4"
    tooltips:
      "A": "Select option A as the correct answer"
      "B": "Select option B as the correct answer"
      "C": "Select option C as the correct answer"
      "D": "Select option D as the correct answer"

  - annotation_type: number
    name: confidence_score
    description: "Confidence score (0-100)"

  - annotation_type: text
    name: explanation
    description: "Provide a brief explanation for your answer choice"

annotation_instructions: |
  You will be shown a graduate-level science question with four answer options.
  1. Read the question and all four options carefully.
  2. Select the correct answer (A, B, C, or D).
  3. Enter your confidence score from 0 (pure guess) to 100 (completely certain).
  4. Write a brief explanation justifying your answer choice.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #fef3c7; border: 1px solid #fde68a; border-radius: 8px; padding: 8px 12px; margin-bottom: 12px; display: inline-block;">
      <span style="font-weight: bold; color: #92400e;">Subject:</span>
      <span style="color: #78350f;">{{subject}}</span>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Question:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">A:</strong> {{option_a}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">B:</strong> {{option_b}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">C:</strong> {{option_c}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">D:</strong> {{option_d}}
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "gpqa_001",
    "text": "Consider a quantum system with two spin-1/2 particles in a singlet state. If a measurement of spin along the z-axis is performed on the first particle and yields spin-up, what is the probability of measuring spin-down along an axis tilted 60 degrees from z on the second particle?",
    "option_a": "1/4",
    "option_b": "3/4",
    "option_c": "1/2",
    "option_d": "cos^2(30) = 3/4",
    "subject": "Physics"
  },
  {
    "id": "gpqa_002",
    "text": "In the context of organic chemistry, which of the following best describes the stereochemical outcome of an E2 elimination reaction on a meso compound with two leaving groups?",
    "option_a": "A racemic mixture of enantiomers",
    "option_b": "A single achiral alkene product",
    "option_c": "A pair of diastereomeric alkenes",
    "option_d": "No reaction occurs due to symmetry constraints",
    "subject": "Chemistry"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/gpqa-expert-qa
potato start config.yaml

Dataset & paper

Rein et al., COLM 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{rein2024gpqa,
    title = "{GPQA}: A Graduate-Level Google-Proof Q&A Benchmark",
    author = "Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R.",
    booktitle = "First Conference on Language Modeling (COLM)",
    year = "2024",
    url = "https://arxiv.org/abs/2311.12022"
}

Details

Annotation Types

numberradiotext

Domain

NLPScienceEvaluation

Use Cases

Expert QAModel EvaluationScience Assessment

Related Designs

BIG-Bench Task Evaluation

Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.

radiotext

HumanEval Code Correctness Evaluation

Evaluation of code generation correctness based on the HumanEval benchmark (Chen et al., arXiv 2021). Annotators assess whether AI-generated code solutions are correct, provide code review comments, and rate code quality on a numeric scale.

radiotext

FActScore: Atomic Factual Precision Evaluation for LLMs

FActScore breaks LLM-generated text into atomic facts and scores the percentage supported by Wikipedia, labeling each as supported, not-supported, or irrelevant. Includes dataset and paper links plus a Potato config to reproduce the annotation task.

radiotext

GPQA - Graduate-Level Expert QA Evaluation

About this dataset

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Dataset & paper

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

BIG-Bench Task Evaluation

HumanEval Code Correctness Evaluation

FActScore: Atomic Factual Precision Evaluation for LLMs