PRM800K Step-by-Step Verification
Step-by-step verification of math reasoning chains based on OpenAI's PRM800K dataset. Annotators mark each reasoning step as Correct, Neutral, or Incorrect, stopping at the first error found, then judge the final answer correctness and explain any errors.
Configuration Fileconfig.yaml
# PRM800K Step-by-Step Verification
# Based on "Let's Verify Step by Step" (Lightman et al., ICLR 2024)
annotation_task_name: "PRM800K Step-by-Step Verification"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="max-width: 800px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
<div style="background: #e8f4fd; border: 1px solid #b3d9f2; border-radius: 8px; padding: 18px 22px; margin-bottom: 20px;">
<h3 style="margin: 0 0 10px 0; color: #1a5276; font-size: 16px;">Math Problem</h3>
<div style="font-size: 15px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
</div>
<div style="margin-bottom: 20px;">
<h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Reasoning Steps</h3>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.7;">{{steps}}</div>
</div>
<div style="background: #fef9e7; border: 2px solid #f9e79f; border-radius: 8px; padding: 14px 20px; margin-bottom: 10px;">
<h3 style="margin: 0 0 6px 0; color: #7d6608; font-size: 15px;">Final Answer</h3>
<div style="font-size: 16px; font-weight: 600; color: #1a5276;">{{final_answer}}</div>
</div>
</div>
annotation_schemes:
- name: "step_correctness"
annotation_type: radio
description: "Rate each reasoning step. Stop and mark 'Incorrect' at the first error you find."
labels:
- "Correct — this step is logically valid"
- "Neutral — this step doesn't clearly advance or harm"
- "Incorrect — this step contains an error"
keyboard_shortcuts:
"Correct — this step is logically valid": "1"
"Neutral — this step doesn't clearly advance or harm": "2"
"Incorrect — this step contains an error": "3"
- name: "solution_correctness"
annotation_type: radio
description: "Is the final answer correct?"
labels:
- "Correct — final answer is right"
- "Incorrect — final answer is wrong"
keyboard_shortcuts:
"Correct — final answer is right": "4"
"Incorrect — final answer is wrong": "5"
- name: "error_explanation"
annotation_type: text
description: "If you found an error, explain what went wrong in that step."
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "prm800k-001",
"text": "Find the sum of all positive integers n such that n^2 + 12n - 2007 is a perfect square.",
"steps": "Step 1: We need n^2 + 12n - 2007 = k^2 for some non-negative integer k.\nStep 2: Complete the square on the left: (n + 6)^2 - 36 - 2007 = k^2, so (n + 6)^2 - 2043 = k^2.\nStep 3: Rearrange to get (n + 6)^2 - k^2 = 2043, which factors as (n + 6 - k)(n + 6 + k) = 2043.\nStep 4: Factor 2043 = 3 × 681 = 3 × 3 × 227. So 2043 = 1 × 2043 = 3 × 681 = 9 × 227.\nStep 5: For each factorization 2043 = a × b with a ≤ b and a, b same parity, we get n + 6 - k = a and n + 6 + k = b, so n = (a + b)/2 - 6.\nStep 6: From (1, 2043): n = (1 + 2043)/2 - 6 = 1022 - 6 = 1016. From (3, 681): n = (3 + 681)/2 - 6 = 342 - 6 = 336. From (9, 227): n = (9 + 227)/2 - 6 = 118 - 6 = 112.",
"final_answer": "The sum is 1016 + 336 + 112 = 1464."
},
{
"id": "prm800k-002",
"text": "A bag contains 4 red balls and 6 blue balls. Two balls are drawn without replacement. What is the probability that both balls are red?",
"steps": "Step 1: The total number of balls is 4 + 6 = 10.\nStep 2: The probability of drawing a red ball first is 4/10 = 2/5.\nStep 3: After drawing one red ball, there are 3 red balls and 6 blue balls left, for 9 total.\nStep 4: The probability of drawing a second red ball is 3/9 = 1/3.\nStep 5: The probability of both events is (2/5) × (1/3) = 2/15.",
"final_answer": "The probability is 2/15."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/prm800k-step-verification potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
Math Question Answering and Category Classification
Mathematical question answering with category classification, covering algebra, geometry, number theory, and statistics. Based on SemEval-2019 Task 10 (Math QA).
AlpacaEval: Instruction-Following Preference Evaluation
Pairwise preference annotation for instruction-following language models. Annotators compare two model responses side by side, select their preferred response, indicate preference strength, and rate individual response quality across diverse instruction categories.
CodePRM Code Process Reward
Process reward annotation for step-by-step code generation with execution feedback. Annotators verify each incremental code generation step against problem requirements, check syntactic validity, and provide feedback informed by test execution results.