SWE-PRM Coding Process Reward
Process reward annotation for software engineering agent traces. Annotators verify each coding action step taken by an SWE agent when resolving GitHub issues, identifying the first step where the agent goes astray and classifying the error type.
Configuration Fileconfig.yaml
# SWE-PRM Coding Process Reward
# Based on "When Agents go Astray: Course-Correcting SWE Agents with PRMs" (Gandhi et al., arXiv 2025)
annotation_task_name: "SWE-PRM Coding Process Reward"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="max-width: 850px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
<div style="background: #e6f9e6; border: 1px solid #a3d9a3; border-radius: 8px; padding: 18px 22px; margin-bottom: 18px;">
<h3 style="margin: 0 0 6px 0; color: #1a6b1a; font-size: 15px;">GitHub Issue — <span style="font-weight: normal; color: #555;">{{repo_name}}</span></h3>
<div style="font-size: 14px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
</div>
<div style="margin-bottom: 18px;">
<h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Agent Steps</h3>
<div style="background: #f8f9fa; border: 1px solid #dee2e6; border-radius: 6px; padding: 14px 18px; font-family: 'Courier New', monospace; font-size: 13px; white-space: pre-wrap; line-height: 1.6;">{{steps}}</div>
</div>
<div style="background: #1e1e1e; border-radius: 8px; padding: 16px 20px; margin-bottom: 10px;">
<h3 style="margin: 0 0 8px 0; color: #8bc34a; font-size: 14px; font-family: 'Courier New', monospace;">Test Results</h3>
<div style="font-family: 'Courier New', monospace; font-size: 13px; color: #4ec94e; white-space: pre-wrap; line-height: 1.5;">{{test_results}}</div>
</div>
</div>
annotation_schemes:
- name: "step_correctness"
annotation_type: radio
description: "Rate each coding agent step. Identify the first step where the agent goes astray."
labels:
- "Correct — this action is appropriate"
- "Neutral — action doesn't help or hurt"
- "Incorrect — this action is wrong or counterproductive"
keyboard_shortcuts:
"Correct — this action is appropriate": "1"
"Neutral — action doesn't help or hurt": "2"
"Incorrect — this action is wrong or counterproductive": "3"
- name: "error_type"
annotation_type: multiselect
description: "Select all error types that apply to the agent trace."
labels:
- "Logic Error"
- "API Misuse"
- "Wrong File Targeted"
- "Incomplete Edit"
- "Redundant Action"
- "Test Error"
- "Correct — No Error"
- name: "error_explanation"
annotation_type: text
description: "Explain where and why the agent went astray."
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "swe-prm-001",
"text": "QuerySet.defer() doesn't clear deferred fields when chaining with .only(). Calling Model.objects.defer('field_a').only('field_b') should load only 'field_b', but it loads both 'field_a' and 'field_b'.",
"repo_name": "django/django",
"steps": "Step 1: [Search] grep -r 'def defer' django/db/models/query.py\n Found: def defer(self, *fields) at line 1142\n\nStep 2: [Read] Open django/db/models/query.py lines 1142-1175\n - defer() adds fields to deferred_loading set\n - only() sets deferred_loading to all fields minus the specified ones\n\nStep 3: [Edit] Modify django/db/models/query.py line 1158\n- old: existing.update(fields)\n+ new: if self.query.deferred_loading[1]:\n+ existing = set(fields)\n+ else:\n+ existing.update(fields)\n\nStep 4: [Edit] Modify django/db/models/query.py line 1170 — update only() to clear prior defer() fields\n- old: self.query.deferred_loading = (frozenset(fields), True)\n+ new: self.query.deferred_loading = (frozenset(), False)\n+ self.query.add_deferred_loading(fields)",
"test_results": "$ python -m pytest tests/defer/tests.py -x\nRan 24 tests in 0.45s\nFAILED (failures=2)\ntest_defer_only_chaining ... FAIL\ntest_only_defer_chaining ... FAIL"
},
{
"id": "swe-prm-002",
"text": "Flask app.url_for() raises BuildError when using SERVER_NAME with subdomains. Setting SERVER_NAME='example.com' and requesting url_for('api.get_users') fails even though the blueprint is registered correctly.",
"repo_name": "pallets/flask",
"steps": "Step 1: [Search] grep -r 'def url_for' src/flask/helpers.py\n Found: def url_for(endpoint, **values) at line 187\n\nStep 2: [Read] Open src/flask/helpers.py lines 187-245\n - url_for calls adapter.build(), which delegates to werkzeug's MapAdapter\n\nStep 3: [Search] grep -r 'SERVER_NAME' src/flask/app.py\n Found references at lines 312, 567, 891\n\nStep 4: [Read] Open src/flask/app.py lines 560-580\n - SERVER_NAME is used to set server_name on the Map object\n - Subdomain matching compares request host vs server_name\n\nStep 5: [Edit] Modify src/flask/helpers.py line 221\n- old: rv = appctx.url_adapter.build(endpoint, values)\n+ new: rv = appctx.url_adapter.build(endpoint, values, force_external=False)",
"test_results": "$ python -m pytest tests/test_helpers.py::test_url_for_with_subdomain -v\ntest_url_for_with_subdomain PASSED\ntest_url_for_external PASSED\ntest_url_for_basic PASSED\nRan 3 tests in 0.12s\nOK"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/swe-prm-coding-process-reward potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.
SWE-Bench+ Patch Screening
Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.
Pairwise Preference with Rationale
Compare two AI responses and select the better one while providing a written justification. Used for reward model training with interpretable preference signals.