Showcase/SWE-PRM Process Reward Labels for Coding Agents

advancedpreference

SWE-PRM Process Reward Labels for Coding Agents

Step-level process reward dataset for coding agents from the SWE-PRM paper (Gandhi et al., 2025). The Potato config reproduces per-step correctness rating, error typing, and explanation over SWE-bench Verified traces.

About this dataset

SWE-PRM comes from "When Agents go Astray: Course-Correcting SWE Agents with PRMs" by Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, and Yara Rizk (arXiv:2509.02360, 2025). The work trains a process reward model (PRM) that scores an agent's actions as it resolves GitHub issues, rather than judging only the final patch.

The data is built on SWE-bench Verified, the 500-task human-validated subset of SWE-bench. Each item is a full agent trajectory: the GitHub issue, the ordered sequence of agent actions (file edits, shell commands, test runs), and the test outcome. Steps are labeled individually so reward signal is available before a task finishes.

The labeling task is step-level process reward. Annotators read each action in a trajectory, mark where the agent first goes astray, and tag the inefficiency. The paper organizes feedback around a taxonomy of common inefficiencies such as redundant exploration, looping, and failing to terminate after the fix is complete. Using the PRM for mid-run course correction raised resolution on SWE-bench Verified from 40.0 percent to 50.6 percent.

The Potato config below reproduces this annotation workflow. It shows the issue, the agent step trace, and the test results, then collects a per-step correctness radio rating, a multiselect of error types (logic error, API misuse, wrong file targeted, incomplete edit, redundant action, test error), and a free-text explanation of where the agent went wrong.

Source paper: Gandhi et al., arXiv:2509.02360 (2025)
Benchmark: SWE-bench Verified
Tasks in SWE-bench Verified: 500 human-validated
Annotation unit: Per-step agent action
Resolution gain with PRM: 40.0% -> 50.6%
Inference cost: as low as $0.2 per task

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# SWE-PRM Coding Process Reward
# Based on "When Agents go Astray: Course-Correcting SWE Agents with PRMs" (Gandhi et al., arXiv 2025)

annotation_task_name: "SWE-PRM Coding Process Reward"
task_dir: "."
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="max-width: 850px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
    <div style="background: #e6f9e6; border: 1px solid #a3d9a3; border-radius: 8px; padding: 18px 22px; margin-bottom: 18px;">
      <h3 style="margin: 0 0 6px 0; color: #1a6b1a; font-size: 15px;">GitHub Issue — <span style="font-weight: normal; color: #555;">{{repo_name}}</span></h3>
      <div style="font-size: 14px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
    </div>
    <div style="margin-bottom: 18px;">
      <h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Agent Steps</h3>
      <div style="background: #f8f9fa; border: 1px solid #dee2e6; border-radius: 6px; padding: 14px 18px; font-family: 'Courier New', monospace; font-size: 13px; white-space: pre-wrap; line-height: 1.6;">{{steps}}</div>
    </div>
    <div style="background: #1e1e1e; border-radius: 8px; padding: 16px 20px; margin-bottom: 10px;">
      <h3 style="margin: 0 0 8px 0; color: #8bc34a; font-size: 14px; font-family: 'Courier New', monospace;">Test Results</h3>
      <div style="font-family: 'Courier New', monospace; font-size: 13px; color: #4ec94e; white-space: pre-wrap; line-height: 1.5;">{{test_results}}</div>
    </div>
  </div>

annotation_schemes:
  - name: "step_correctness"
    annotation_type: radio
    description: "Rate each coding agent step. Identify the first step where the agent goes astray."
    labels:
      - "Correct — this action is appropriate"
      - "Neutral — action doesn't help or hurt"
      - "Incorrect — this action is wrong or counterproductive"
    keyboard_shortcuts:
      "Correct — this action is appropriate": "1"
      "Neutral — action doesn't help or hurt": "2"
      "Incorrect — this action is wrong or counterproductive": "3"

  - name: "error_type"
    annotation_type: multiselect
    description: "Select all error types that apply to the agent trace."
    labels:
      - "Logic Error"
      - "API Misuse"
      - "Wrong File Targeted"
      - "Incomplete Edit"
      - "Redundant Action"
      - "Test Error"
      - "Correct — No Error"

  - name: "error_explanation"
    annotation_type: text
    description: "Explain where and why the agent went astray."

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "swe-prm-001",
    "text": "QuerySet.defer() doesn't clear deferred fields when chaining with .only(). Calling Model.objects.defer('field_a').only('field_b') should load only 'field_b', but it loads both 'field_a' and 'field_b'.",
    "repo_name": "django/django",
    "steps": "Step 1: [Search] grep -r 'def defer' django/db/models/query.py\n  Found: def defer(self, *fields) at line 1142\n\nStep 2: [Read] Open django/db/models/query.py lines 1142-1175\n  - defer() adds fields to deferred_loading set\n  - only() sets deferred_loading to all fields minus the specified ones\n\nStep 3: [Edit] Modify django/db/models/query.py line 1158\n- old: existing.update(fields)\n+ new: if self.query.deferred_loading[1]:\n+          existing = set(fields)\n+      else:\n+          existing.update(fields)\n\nStep 4: [Edit] Modify django/db/models/query.py line 1170 — update only() to clear prior defer() fields\n- old: self.query.deferred_loading = (frozenset(fields), True)\n+ new: self.query.deferred_loading = (frozenset(), False)\n+      self.query.add_deferred_loading(fields)",
    "test_results": "$ python -m pytest tests/defer/tests.py -x\nRan 24 tests in 0.45s\nFAILED (failures=2)\ntest_defer_only_chaining ... FAIL\ntest_only_defer_chaining ... FAIL"
  },
  {
    "id": "swe-prm-002",
    "text": "Flask app.url_for() raises BuildError when using SERVER_NAME with subdomains. Setting SERVER_NAME='example.com' and requesting url_for('api.get_users') fails even though the blueprint is registered correctly.",
    "repo_name": "pallets/flask",
    "steps": "Step 1: [Search] grep -r 'def url_for' src/flask/helpers.py\n  Found: def url_for(endpoint, **values) at line 187\n\nStep 2: [Read] Open src/flask/helpers.py lines 187-245\n  - url_for calls adapter.build(), which delegates to werkzeug's MapAdapter\n\nStep 3: [Search] grep -r 'SERVER_NAME' src/flask/app.py\n  Found references at lines 312, 567, 891\n\nStep 4: [Read] Open src/flask/app.py lines 560-580\n  - SERVER_NAME is used to set server_name on the Map object\n  - Subdomain matching compares request host vs server_name\n\nStep 5: [Edit] Modify src/flask/helpers.py line 221\n- old: rv = appctx.url_adapter.build(endpoint, values)\n+ new: rv = appctx.url_adapter.build(endpoint, values, force_external=False)",
    "test_results": "$ python -m pytest tests/test_helpers.py::test_url_for_with_subdomain -v\ntest_url_for_with_subdomain PASSED\ntest_url_for_external PASSED\ntest_url_for_basic PASSED\nRan 3 tests in 0.12s\nOK"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/swe-prm-coding-process-reward
potato start config.yaml

Dataset & paper

Gandhi et al., arXiv 2025

Read the paper ↗

Citation (BibTeX)

bibtex

@article{gandhi2025agents, title={When Agents go Astray: Course-Correcting SWE Agents with PRMs}, author={Gandhi, Shubham and Tsay, Jason and Ganhotra, Jatin and Kate, Kiran and Rizk, Yara}, journal={arXiv preprint arXiv:2509.02360}, year={2025}}

Details

Annotation Types

radiomultiselecttext

Domain

Software EngineeringCode

Use Cases

Process Reward ModelsRLHFAgent Evaluation

Related Designs

RefactorBench Multi-File Evaluation

Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.

radiomultiselect

SWE-Bench+ Patch Screening

Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.