Skip to content
Showcase/SWE-PRM Coding Process Reward
advancedpreference

SWE-PRM Coding Process Reward

Process reward annotation for software engineering agent traces. Annotators verify each coding action step taken by an SWE agent when resolving GitHub issues, identifying the first step where the agent goes astray and classifying the error type.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# SWE-PRM Coding Process Reward
# Based on "When Agents go Astray: Course-Correcting SWE Agents with PRMs" (Gandhi et al., arXiv 2025)

annotation_task_name: "SWE-PRM Coding Process Reward"
task_dir: "."
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="max-width: 850px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
    <div style="background: #e6f9e6; border: 1px solid #a3d9a3; border-radius: 8px; padding: 18px 22px; margin-bottom: 18px;">
      <h3 style="margin: 0 0 6px 0; color: #1a6b1a; font-size: 15px;">GitHub Issue — <span style="font-weight: normal; color: #555;">{{repo_name}}</span></h3>
      <div style="font-size: 14px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
    </div>
    <div style="margin-bottom: 18px;">
      <h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Agent Steps</h3>
      <div style="background: #f8f9fa; border: 1px solid #dee2e6; border-radius: 6px; padding: 14px 18px; font-family: 'Courier New', monospace; font-size: 13px; white-space: pre-wrap; line-height: 1.6;">{{steps}}</div>
    </div>
    <div style="background: #1e1e1e; border-radius: 8px; padding: 16px 20px; margin-bottom: 10px;">
      <h3 style="margin: 0 0 8px 0; color: #8bc34a; font-size: 14px; font-family: 'Courier New', monospace;">Test Results</h3>
      <div style="font-family: 'Courier New', monospace; font-size: 13px; color: #4ec94e; white-space: pre-wrap; line-height: 1.5;">{{test_results}}</div>
    </div>
  </div>

annotation_schemes:
  - name: "step_correctness"
    annotation_type: radio
    description: "Rate each coding agent step. Identify the first step where the agent goes astray."
    labels:
      - "Correct — this action is appropriate"
      - "Neutral — action doesn't help or hurt"
      - "Incorrect — this action is wrong or counterproductive"
    keyboard_shortcuts:
      "Correct — this action is appropriate": "1"
      "Neutral — action doesn't help or hurt": "2"
      "Incorrect — this action is wrong or counterproductive": "3"

  - name: "error_type"
    annotation_type: multiselect
    description: "Select all error types that apply to the agent trace."
    labels:
      - "Logic Error"
      - "API Misuse"
      - "Wrong File Targeted"
      - "Incomplete Edit"
      - "Redundant Action"
      - "Test Error"
      - "Correct — No Error"

  - name: "error_explanation"
    annotation_type: text
    description: "Explain where and why the agent went astray."

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "swe-prm-001",
    "text": "QuerySet.defer() doesn't clear deferred fields when chaining with .only(). Calling Model.objects.defer('field_a').only('field_b') should load only 'field_b', but it loads both 'field_a' and 'field_b'.",
    "repo_name": "django/django",
    "steps": "Step 1: [Search] grep -r 'def defer' django/db/models/query.py\n  Found: def defer(self, *fields) at line 1142\n\nStep 2: [Read] Open django/db/models/query.py lines 1142-1175\n  - defer() adds fields to deferred_loading set\n  - only() sets deferred_loading to all fields minus the specified ones\n\nStep 3: [Edit] Modify django/db/models/query.py line 1158\n- old: existing.update(fields)\n+ new: if self.query.deferred_loading[1]:\n+          existing = set(fields)\n+      else:\n+          existing.update(fields)\n\nStep 4: [Edit] Modify django/db/models/query.py line 1170 — update only() to clear prior defer() fields\n- old: self.query.deferred_loading = (frozenset(fields), True)\n+ new: self.query.deferred_loading = (frozenset(), False)\n+      self.query.add_deferred_loading(fields)",
    "test_results": "$ python -m pytest tests/defer/tests.py -x\nRan 24 tests in 0.45s\nFAILED (failures=2)\ntest_defer_only_chaining ... FAIL\ntest_only_defer_chaining ... FAIL"
  },
  {
    "id": "swe-prm-002",
    "text": "Flask app.url_for() raises BuildError when using SERVER_NAME with subdomains. Setting SERVER_NAME='example.com' and requesting url_for('api.get_users') fails even though the blueprint is registered correctly.",
    "repo_name": "pallets/flask",
    "steps": "Step 1: [Search] grep -r 'def url_for' src/flask/helpers.py\n  Found: def url_for(endpoint, **values) at line 187\n\nStep 2: [Read] Open src/flask/helpers.py lines 187-245\n  - url_for calls adapter.build(), which delegates to werkzeug's MapAdapter\n\nStep 3: [Search] grep -r 'SERVER_NAME' src/flask/app.py\n  Found references at lines 312, 567, 891\n\nStep 4: [Read] Open src/flask/app.py lines 560-580\n  - SERVER_NAME is used to set server_name on the Map object\n  - Subdomain matching compares request host vs server_name\n\nStep 5: [Edit] Modify src/flask/helpers.py line 221\n- old: rv = appctx.url_adapter.build(endpoint, values)\n+ new: rv = appctx.url_adapter.build(endpoint, values, force_external=False)",
    "test_results": "$ python -m pytest tests/test_helpers.py::test_url_for_with_subdomain -v\ntest_url_for_with_subdomain PASSED\ntest_url_for_external PASSED\ntest_url_for_basic PASSED\nRan 3 tests in 0.12s\nOK"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/swe-prm-coding-process-reward
potato start config.yaml

Details

Annotation Types

radiomultiselecttext

Domain

Software EngineeringCode

Use Cases

Process Reward ModelsRLHFAgent Evaluation

Tags

process-reward-modelSWE-agentcodingbug-fixingagent-evaluation

Found an issue or want to improve this design?

Open an Issue