SWE-PRM Process Reward Labels for Coding Agents
Step-level process reward dataset for coding agents from the SWE-PRM paper (Gandhi et al., 2025). The Potato config reproduces per-step correctness rating, error typing, and explanation over SWE-bench Verified traces.
About this dataset
SWE-PRM comes from "When Agents go Astray: Course-Correcting SWE Agents with PRMs" by Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, and Yara Rizk (arXiv:2509.02360, 2025). The work trains a process reward model (PRM) that scores an agent's actions as it resolves GitHub issues, rather than judging only the final patch.
The data is built on SWE-bench Verified, the 500-task human-validated subset of SWE-bench. Each item is a full agent trajectory: the GitHub issue, the ordered sequence of agent actions (file edits, shell commands, test runs), and the test outcome. Steps are labeled individually so reward signal is available before a task finishes.
The labeling task is step-level process reward. Annotators read each action in a trajectory, mark where the agent first goes astray, and tag the inefficiency. The paper organizes feedback around a taxonomy of common inefficiencies such as redundant exploration, looping, and failing to terminate after the fix is complete. Using the PRM for mid-run course correction raised resolution on SWE-bench Verified from 40.0 percent to 50.6 percent.
The Potato config below reproduces this annotation workflow. It shows the issue, the agent step trace, and the test results, then collects a per-step correctness radio rating, a multiselect of error types (logic error, API misuse, wrong file targeted, incomplete edit, redundant action, test error), and a free-text explanation of where the agent went wrong.
- Source paper
- Gandhi et al., arXiv:2509.02360 (2025)
- Benchmark
- SWE-bench Verified
- Tasks in SWE-bench Verified
- 500 human-validated
- Annotation unit
- Per-step agent action
- Resolution gain with PRM
- 40.0% -> 50.6%
- Inference cost
- as low as $0.2 per task
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# SWE-PRM Coding Process Reward
# Based on "When Agents go Astray: Course-Correcting SWE Agents with PRMs" (Gandhi et al., arXiv 2025)
annotation_task_name: "SWE-PRM Coding Process Reward"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="max-width: 850px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
<div style="background: #e6f9e6; border: 1px solid #a3d9a3; border-radius: 8px; padding: 18px 22px; margin-bottom: 18px;">
<h3 style="margin: 0 0 6px 0; color: #1a6b1a; font-size: 15px;">GitHub Issue — <span style="font-weight: normal; color: #555;">{{repo_name}}</span></h3>
<div style="font-size: 14px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
</div>
<div style="margin-bottom: 18px;">
<h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Agent Steps</h3>
<div style="background: #f8f9fa; border: 1px solid #dee2e6; border-radius: 6px; padding: 14px 18px; font-family: 'Courier New', monospace; font-size: 13px; white-space: pre-wrap; line-height: 1.6;">{{steps}}</div>
</div>
<div style="background: #1e1e1e; border-radius: 8px; padding: 16px 20px; margin-bottom: 10px;">
<h3 style="margin: 0 0 8px 0; color: #8bc34a; font-size: 14px; font-family: 'Courier New', monospace;">Test Results</h3>
<div style="font-family: 'Courier New', monospace; font-size: 13px; color: #4ec94e; white-space: pre-wrap; line-height: 1.5;">{{test_results}}</div>
</div>
</div>
annotation_schemes:
- name: "step_correctness"
annotation_type: radio
description: "Rate each coding agent step. Identify the first step where the agent goes astray."
labels:
- "Correct — this action is appropriate"
- "Neutral — action doesn't help or hurt"
- "Incorrect — this action is wrong or counterproductive"
keyboard_shortcuts:
"Correct — this action is appropriate": "1"
"Neutral — action doesn't help or hurt": "2"
"Incorrect — this action is wrong or counterproductive": "3"
- name: "error_type"
annotation_type: multiselect
description: "Select all error types that apply to the agent trace."
labels:
- "Logic Error"
- "API Misuse"
- "Wrong File Targeted"
- "Incomplete Edit"
- "Redundant Action"
- "Test Error"
- "Correct — No Error"
- name: "error_explanation"
annotation_type: text
description: "Explain where and why the agent went astray."
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "swe-prm-001",
"text": "QuerySet.defer() doesn't clear deferred fields when chaining with .only(). Calling Model.objects.defer('field_a').only('field_b') should load only 'field_b', but it loads both 'field_a' and 'field_b'.",
"repo_name": "django/django",
"steps": "Step 1: [Search] grep -r 'def defer' django/db/models/query.py\n Found: def defer(self, *fields) at line 1142\n\nStep 2: [Read] Open django/db/models/query.py lines 1142-1175\n - defer() adds fields to deferred_loading set\n - only() sets deferred_loading to all fields minus the specified ones\n\nStep 3: [Edit] Modify django/db/models/query.py line 1158\n- old: existing.update(fields)\n+ new: if self.query.deferred_loading[1]:\n+ existing = set(fields)\n+ else:\n+ existing.update(fields)\n\nStep 4: [Edit] Modify django/db/models/query.py line 1170 — update only() to clear prior defer() fields\n- old: self.query.deferred_loading = (frozenset(fields), True)\n+ new: self.query.deferred_loading = (frozenset(), False)\n+ self.query.add_deferred_loading(fields)",
"test_results": "$ python -m pytest tests/defer/tests.py -x\nRan 24 tests in 0.45s\nFAILED (failures=2)\ntest_defer_only_chaining ... FAIL\ntest_only_defer_chaining ... FAIL"
},
{
"id": "swe-prm-002",
"text": "Flask app.url_for() raises BuildError when using SERVER_NAME with subdomains. Setting SERVER_NAME='example.com' and requesting url_for('api.get_users') fails even though the blueprint is registered correctly.",
"repo_name": "pallets/flask",
"steps": "Step 1: [Search] grep -r 'def url_for' src/flask/helpers.py\n Found: def url_for(endpoint, **values) at line 187\n\nStep 2: [Read] Open src/flask/helpers.py lines 187-245\n - url_for calls adapter.build(), which delegates to werkzeug's MapAdapter\n\nStep 3: [Search] grep -r 'SERVER_NAME' src/flask/app.py\n Found references at lines 312, 567, 891\n\nStep 4: [Read] Open src/flask/app.py lines 560-580\n - SERVER_NAME is used to set server_name on the Map object\n - Subdomain matching compares request host vs server_name\n\nStep 5: [Edit] Modify src/flask/helpers.py line 221\n- old: rv = appctx.url_adapter.build(endpoint, values)\n+ new: rv = appctx.url_adapter.build(endpoint, values, force_external=False)",
"test_results": "$ python -m pytest tests/test_helpers.py::test_url_for_with_subdomain -v\ntest_url_for_with_subdomain PASSED\ntest_url_for_external PASSED\ntest_url_for_basic PASSED\nRan 3 tests in 0.12s\nOK"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/swe-prm-coding-process-reward potato start config.yaml
Dataset & paper
Gandhi et al., arXiv 2025
Citation (BibTeX)
@article{gandhi2025agents, title={When Agents go Astray: Course-Correcting SWE Agents with PRMs}, author={Gandhi, Shubham and Tsay, Jason and Ganhotra, Jatin and Kate, Kiran and Rizk, Yara}, journal={arXiv preprint arXiv:2509.02360}, year={2025}}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.
SWE-Bench+ Patch Screening
Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.
Pairwise Preference with Rationale
Compare two AI responses and select the better one while providing a written justification. Used for reward model training with interpretable preference signals.