PRM800K Step-by-Step Verification

Step-by-step verification of math reasoning chains based on OpenAI's PRM800K dataset. Annotators mark each reasoning step as Correct, Neutral, or Incorrect, stopping at the first error found, then judge the final answer correctness and explain any errors.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# PRM800K Step-by-Step Verification
# Based on "Let's Verify Step by Step" (Lightman et al., ICLR 2024)

annotation_task_name: "PRM800K Step-by-Step Verification"
task_dir: "."
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="max-width: 800px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
    <div style="background: #e8f4fd; border: 1px solid #b3d9f2; border-radius: 8px; padding: 18px 22px; margin-bottom: 20px;">
      <h3 style="margin: 0 0 10px 0; color: #1a5276; font-size: 16px;">Math Problem</h3>
      <div style="font-size: 15px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
    </div>
    <div style="margin-bottom: 20px;">
      <h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Reasoning Steps</h3>
      <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.7;">{{steps}}</div>
    </div>
    <div style="background: #fef9e7; border: 2px solid #f9e79f; border-radius: 8px; padding: 14px 20px; margin-bottom: 10px;">
      <h3 style="margin: 0 0 6px 0; color: #7d6608; font-size: 15px;">Final Answer</h3>
      <div style="font-size: 16px; font-weight: 600; color: #1a5276;">{{final_answer}}</div>
    </div>
  </div>

annotation_schemes:
  - name: "step_correctness"
    annotation_type: radio
    description: "Rate each reasoning step. Stop and mark 'Incorrect' at the first error you find."
    labels:
      - "Correct — this step is logically valid"
      - "Neutral — this step doesn't clearly advance or harm"
      - "Incorrect — this step contains an error"
    keyboard_shortcuts:
      "Correct — this step is logically valid": "1"
      "Neutral — this step doesn't clearly advance or harm": "2"
      "Incorrect — this step contains an error": "3"

  - name: "solution_correctness"
    annotation_type: radio
    description: "Is the final answer correct?"
    labels:
      - "Correct — final answer is right"
      - "Incorrect — final answer is wrong"
    keyboard_shortcuts:
      "Correct — final answer is right": "4"
      "Incorrect — final answer is wrong": "5"

  - name: "error_explanation"
    annotation_type: text
    description: "If you found an error, explain what went wrong in that step."

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "prm800k-001",
    "text": "Find the sum of all positive integers n such that n^2 + 12n - 2007 is a perfect square.",
    "steps": "Step 1: We need n^2 + 12n - 2007 = k^2 for some non-negative integer k.\nStep 2: Complete the square on the left: (n + 6)^2 - 36 - 2007 = k^2, so (n + 6)^2 - 2043 = k^2.\nStep 3: Rearrange to get (n + 6)^2 - k^2 = 2043, which factors as (n + 6 - k)(n + 6 + k) = 2043.\nStep 4: Factor 2043 = 3 × 681 = 3 × 3 × 227. So 2043 = 1 × 2043 = 3 × 681 = 9 × 227.\nStep 5: For each factorization 2043 = a × b with a ≤ b and a, b same parity, we get n + 6 - k = a and n + 6 + k = b, so n = (a + b)/2 - 6.\nStep 6: From (1, 2043): n = (1 + 2043)/2 - 6 = 1022 - 6 = 1016. From (3, 681): n = (3 + 681)/2 - 6 = 342 - 6 = 336. From (9, 227): n = (9 + 227)/2 - 6 = 118 - 6 = 112.",
    "final_answer": "The sum is 1016 + 336 + 112 = 1464."
  },
  {
    "id": "prm800k-002",
    "text": "A bag contains 4 red balls and 6 blue balls. Two balls are drawn without replacement. What is the probability that both balls are red?",
    "steps": "Step 1: The total number of balls is 4 + 6 = 10.\nStep 2: The probability of drawing a red ball first is 4/10 = 2/5.\nStep 3: After drawing one red ball, there are 3 red balls and 6 blue balls left, for 9 total.\nStep 4: The probability of drawing a second red ball is 3/9 = 1/3.\nStep 5: The probability of both events is (2/5) × (1/3) = 2/15.",
    "final_answer": "The probability is 2/15."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/prm800k-step-verification
potato start config.yaml

Dataset & paper

Lightman et al., ICLR 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{lightman2024lets, title={Let's Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, booktitle={International Conference on Learning Representations (ICLR)}, year={2024}}

Details

Annotation Types

radiotext

Domain

MathematicsReasoning

Use Cases

Process Reward ModelsRLHFMath Verification

Related Designs

Math Question Answering and Category Classification

Mathematical question answering with category classification, covering algebra, geometry, number theory, and statistics. Based on SemEval-2019 Task 10 (Math QA).

textradio

AlpacaEval: Instruction-Following Preference Evaluation

Pairwise preference annotation for instruction-following language models. Annotators compare two model responses side by side, select their preferred response, indicate preference strength, and rate individual response quality across diverse instruction categories.

pairwiseradio

CodePRM Code Process Reward

Process reward annotation for step-by-step code generation with execution feedback. Annotators verify each incremental code generation step against problem requirements, check syntactic validity, and provide feedback informed by test execution results.