Skip to content
Guides10 min read

How to Collect Process Reward Data for Training Better Coding Agents

Step-by-step guide to collecting per-step reward signals for PRM training using Potato. Covers first-error mode, per-step annotation, and export to training pipelines.

Potato Team·
هذه الصفحة غير متوفرة بلغتك بعد. يتم عرض النسخة الإنجليزية.

What Are Process Reward Models?

Traditional outcome reward models (ORMs) evaluate only the final result of a coding agent's trajectory: did the code compile, did the tests pass, was the issue resolved? Process reward models (PRMs) go further by assigning a reward signal to each intermediate step. This fine-grained supervision enables training methods that can identify exactly where agents go wrong, leading to more sample-efficient learning and better generalization.

Recent research has demonstrated the power of this approach. AgentPRM showed that step-level reward signals improve agent performance on SWE-bench by 12-18% compared to outcome-only supervision. ToolRM demonstrated that per-tool-call rewards help agents learn which tools to use in which contexts. DeepSWE combined process rewards with Monte Carlo tree search to achieve state-of-the-art results on complex software engineering tasks.

The bottleneck for all of these methods is high-quality, step-level human annotations. Potato's process reward annotation schemas are designed to make this data collection as efficient as possible.

Two Annotation Modes

Potato supports two PRM annotation modes that trade off speed against granularity. Choose the mode that matches your data budget and research goals.

First-Error Mode

In first-error mode, the annotator reviews the trajectory from top to bottom and clicks on the first step where the agent makes a mistake. Potato then automatically labels all preceding steps as correct and all subsequent steps (including the clicked one) as incorrect.

This mode is fast because the annotator only needs to identify a single decision point. It works well when errors tend to cascade, meaning that once an agent goes off track, the remaining steps are unlikely to recover. This is the most common pattern in practice.

yaml
annotation_schemes:
  - annotation_type: process_reward
    name: prm_first_error
    mode: "first_error"
    labels:
      correct: "Correct"
      incorrect: "Incorrect"
    description: >
      Review the agent's steps from top to bottom. Click on the
      first step where the agent makes a mistake. All steps before
      your selection will be marked correct; all steps after
      (including the selected step) will be marked incorrect.
    allow_all_correct: true
    allow_all_incorrect: true
    highlight_clicked_step: true
    auto_scroll_on_click: true
    show_step_numbers: true
    confirmation_dialog: true     # Confirm before submitting

The first-error annotation workflow looks like this:

  1. The annotator opens a trace and sees all steps rendered with the CodingTraceDisplay component.
  2. They read through the steps sequentially, examining diffs, terminal outputs, and reasoning.
  3. When they find the first incorrect step, they click the error marker next to it.
  4. Steps 0 through N-1 turn green (correct), steps N through the end turn red (incorrect).
  5. The annotator reviews the automatic labeling and clicks "Submit" to confirm.

If the entire trace is correct (the agent solved the task perfectly), the annotator clicks "All Correct." If the very first step is already wrong, they click step 0 or use "All Incorrect."

Per-Step Mode

In per-step mode, every step in the trajectory receives an independent label. This produces richer training data that captures cases where agents partially recover from errors, take unnecessary but not harmful detours, or make steps that are correct in isolation but wrong in context.

yaml
annotation_schemes:
  - annotation_type: process_reward
    name: prm_per_step
    mode: "per_step"
    labels:
      correct:
        text: "Correct"
        description: >
          This step is logically sound, makes progress toward the goal,
          and does not introduce bugs or unnecessary complexity.
        keyboard_shortcut: "1"
        color: "#22c55e"
      partially_correct:
        text: "Partially Correct"
        description: >
          This step is in the right direction but has flaws: incomplete
          fix, unnecessary side effects, suboptimal approach, or missing
          edge cases.
        keyboard_shortcut: "2"
        color: "#eab308"
      incorrect:
        text: "Incorrect"
        description: >
          This step is wrong, counterproductive, or introduces new bugs.
          The agent should not have taken this action.
        keyboard_shortcut: "3"
        color: "#ef4444"
      neutral:
        text: "Neutral"
        description: >
          This step neither helps nor hurts. Includes reading files for
          context or running diagnostic commands.
        keyboard_shortcut: "4"
        color: "#94a3b8"
    require_all_steps: true
    show_progress_bar: true
    enable_keyboard_navigation: true
    step_navigation:
      next_step: "j"
      previous_step: "k"
      next_unlabeled: "n"

Setting Up a PRM Annotation Project

Step 1: Prepare Your Trace Data

Your input data should be a JSONL file where each line is a JSON object containing the agent's trajectory. The key fields are id, issue_description, and structured_turns:

json
{
  "id": "trace_001",
  "issue_description": "Fix the TypeError in parse_config() when config file is empty",
  "repo": "myorg/myproject",
  "base_commit": "a1b2c3d",
  "structured_turns": [
    {
      "step_idx": 0,
      "type": "thinking",
      "content": "I need to understand what parse_config does and where the TypeError occurs. Let me read the file first."
    },
    {
      "step_idx": 1,
      "type": "file_read",
      "path": "src/config.py",
      "content": "def parse_config(path):\n    with open(path) as f:\n        data = json.load(f)\n    return data['settings']",
      "start_line": 1,
      "end_line": 4
    },
    {
      "step_idx": 2,
      "type": "thinking",
      "content": "The bug is on line 4. When the file is empty, json.load returns None, and None['settings'] raises TypeError. I should add a check."
    },
    {
      "step_idx": 3,
      "type": "file_edit",
      "path": "src/config.py",
      "diff": "--- a/src/config.py\n+++ b/src/config.py\n@@ -1,4 +1,6 @@\n def parse_config(path):\n     with open(path) as f:\n         data = json.load(f)\n+    if data is None:\n+        return {}\n     return data['settings']"
    },
    {
      "step_idx": 4,
      "type": "bash_command",
      "command": "python -m pytest tests/test_config.py -v",
      "output": "tests/test_config.py::test_parse_config_empty PASSED\ntests/test_config.py::test_parse_config_valid PASSED\n\n2 passed in 0.12s",
      "exit_code": 0
    }
  ]
}

If you are converting from an existing agent format, use the trace converter tool:

bash
# Convert Claude Code traces
potato convert-traces \
  --format claude_code \
  --input ./raw_traces/ \
  --output ./data/traces.jsonl
 
# Convert SWE-Agent trajectories
potato convert-traces \
  --format swe_agent \
  --input ./swe_agent_output/ \
  --output ./data/traces.jsonl

Step 2: Create Your Configuration

Here is a complete project configuration for PRM annotation using first-error mode:

yaml
# config.yaml
project_name: "PRM Data Collection - SWE-bench Traces"
port: 8000
 
data:
  source: "local"
  input_path: "./data/traces.jsonl"
  data_format: "coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    context_lines: 3
    syntax_highlighting: true
    terminal_theme: "dark"
    file_tree:
      enabled: true
      position: "left"
    collapsible:
      auto_collapse_thinking: true
      auto_collapse_long_output: true
      long_output_threshold: 50
 
annotation_schemes:
  - annotation_type: process_reward
    name: step_reward
    mode: "first_error"
    labels:
      correct: "Correct Step"
      incorrect: "Incorrect Step"
    allow_all_correct: true
    allow_all_incorrect: true
    description: >
      Review the agent's trajectory step by step. Click the first
      step where the agent makes an error. If the entire trajectory
      is correct, click "All Correct."
    highlight_clicked_step: true
    confirmation_dialog: true
 
  - annotation_type: radio
    name: outcome
    label: "Did the agent resolve the issue?"
    options:
      - value: "resolved"
        text: "Fully Resolved"
      - value: "partial"
        text: "Partially Resolved"
      - value: "not_resolved"
        text: "Not Resolved"
 
  - annotation_type: text_input
    name: error_description
    label: "If incorrect, briefly describe the error"
    placeholder: "e.g., Agent edited the wrong file..."
    required: false
    show_if:
      field: "step_reward"
      condition: "has_error"
 
output:
  path: "./output/"
  format: "jsonl"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 15
  minimum_time_per_instance: 20
 
annotators:
  - username: "reviewer1"
    password: "pw_reviewer1"
  - username: "reviewer2"
    password: "pw_reviewer2"
  - username: "reviewer3"
    password: "pw_reviewer3"

Step 3: Launch the Annotation Server

bash
# Start the annotation server
potato start config.yaml -p 8000
 
# Or run in the background
nohup potato start config.yaml -p 8000 > potato.log 2>&1 &

Navigate to http://localhost:8000, log in with one of the configured annotator accounts, and begin reviewing traces.

Step 4: Monitor Progress

While annotation is in progress, monitor progress and agreement:

bash
# Check annotation progress
potato status config.yaml
 
# View inter-annotator agreement
potato agreement config.yaml --metric krippendorff_alpha

Exporting to Training Formats

Once annotation is complete, export the data in the format your training pipeline expects.

PRM Format for Reward Model Training

The PRM export format produces one JSON object per trace with step-level labels:

bash
potato export \
  --format prm \
  --project ./output/ \
  --output ./training_data/prm_labels.jsonl

The output looks like this:

json
{
  "trace_id": "trace_001",
  "issue_description": "Fix the TypeError in parse_config() when config file is empty",
  "total_steps": 5,
  "first_error_step": null,
  "all_correct": true,
  "steps": [
    {"step_idx": 0, "type": "thinking", "label": "correct", "reward": 1.0},
    {"step_idx": 1, "type": "file_read", "label": "correct", "reward": 1.0},
    {"step_idx": 2, "type": "thinking", "label": "correct", "reward": 1.0},
    {"step_idx": 3, "type": "file_edit", "label": "correct", "reward": 1.0},
    {"step_idx": 4, "type": "bash_command", "label": "correct", "reward": 1.0}
  ]
}

DPO/RLHF Preference Pairs

When you have multiple traces for the same issue (e.g., from different agents or different runs), Potato can generate preference pairs based on PRM labels:

bash
potato export \
  --format preference_pairs \
  --project ./output/ \
  --output ./training_data/preferences.jsonl \
  --pair_by "issue_id"

The preference pair export compares traces that attempted the same task and selects the better one based on step-level labels:

json
{
  "prompt": "Fix the TypeError in parse_config() when config file is empty",
  "chosen_trace_id": "trace_001",
  "rejected_trace_id": "trace_002",
  "chosen_first_error": null,
  "rejected_first_error": 3,
  "chosen_steps": 5,
  "rejected_steps": 7,
  "margin": 0.8
}

SWE-bench Compatible Results

Export in SWE-bench format for benchmarking:

bash
potato export \
  --format swe_bench \
  --project ./output/ \
  --output ./training_data/swe_bench_results.json

Analysis Examples

After collecting annotations, use these Python snippets to analyze the data and identify patterns.

Step-Level Accuracy by Step Type

python
import json
from collections import defaultdict
 
# Load PRM annotations
with open("training_data/prm_labels.jsonl") as f:
    traces = [json.loads(line) for line in f]
 
# Compute accuracy by step type
type_stats = defaultdict(lambda: {"correct": 0, "total": 0})
 
for trace in traces:
    for step in trace["steps"]:
        step_type = step["type"]
        type_stats[step_type]["total"] += 1
        if step["label"] == "correct":
            type_stats[step_type]["correct"] += 1
 
print("Step-Level Accuracy by Type:")
print("-" * 45)
for step_type, stats in sorted(type_stats.items()):
    acc = stats["correct"] / stats["total"] * 100
    print(f"  {step_type:<20} {acc:5.1f}%  ({stats['correct']}/{stats['total']})")

Finding Common Failure Points

python
import json
from collections import Counter
 
with open("training_data/prm_labels.jsonl") as f:
    traces = [json.loads(line) for line in f]
 
# Analyze where errors first occur
error_positions = []
error_types_at_first_error = Counter()
 
for trace in traces:
    if trace["first_error_step"] is not None:
        pos = trace["first_error_step"]
        total = trace["total_steps"]
        # Normalize position to 0-1 range
        error_positions.append(pos / total)
        # Track what type of step caused the first error
        error_step = trace["steps"][pos]
        error_types_at_first_error[error_step["type"]] += 1
 
if error_positions:
    avg_pos = sum(error_positions) / len(error_positions)
    print(f"Average first-error position: {avg_pos:.2f} (0=start, 1=end)")
    print(f"Traces with errors: {len(error_positions)}/{len(traces)}")
    print()
    print("Most common step types at first error:")
    for step_type, count in error_types_at_first_error.most_common(5):
        print(f"  {step_type}: {count}")

Computing Inter-Annotator Agreement on PRM Labels

python
import json
import numpy as np
from sklearn.metrics import cohen_kappa_score
 
def load_annotations(annotator_file):
    """Load annotations from a single annotator's output file."""
    with open(annotator_file) as f:
        data = {item["trace_id"]: item for item in
                (json.loads(line) for line in f)}
    return data
 
ann1 = load_annotations("output/reviewer1/annotations.jsonl")
ann2 = load_annotations("output/reviewer2/annotations.jsonl")
 
# Find overlapping traces
overlap_ids = set(ann1.keys()) & set(ann2.keys())
print(f"Overlapping traces: {len(overlap_ids)}")
 
# Compare first-error step labels
labels1 = []
labels2 = []
for trace_id in overlap_ids:
    fe1 = ann1[trace_id].get("first_error_step", -1)
    fe2 = ann2[trace_id].get("first_error_step", -1)
    # Bin into: all_correct, early_error (first half), late_error (second half)
    total = ann1[trace_id]["total_steps"]
    for fe, labels in [(fe1, labels1), (fe2, labels2)]:
        if fe is None or fe == -1:
            labels.append("all_correct")
        elif fe < total / 2:
            labels.append("early_error")
        else:
            labels.append("late_error")
 
kappa = cohen_kappa_score(labels1, labels2)
print(f"Cohen's kappa (binned first-error): {kappa:.3f}")

Tips for Efficient PRM Data Collection

Use first-error mode for speed. If your primary goal is training a PRM to guide search (e.g., for MCTS or best-of-N sampling), first-error mode provides sufficient signal at 2-3x the annotation speed of per-step mode. Most agents exhibit cascading errors where the first mistake leads to a chain of subsequent failures.

Use per-step mode for fine-grained analysis. When you need to understand partial recovery, unnecessary-but-harmless steps, or when building datasets for step-level reward model training with more than two labels, per-step mode is worth the extra time.

Combine PRM with pairwise comparison. Annotate traces individually with PRM labels, then also run a pairwise comparison task on traces that attempted the same issue. This gives you both step-level rewards and preference pairs from a single annotation effort.

Start with experienced annotators. PRM annotation requires understanding code, diffs, and terminal output. Begin with a small group of experienced developers, measure agreement, calibrate with examples, then scale to a larger team.

Set a minimum time per instance. Traces can be complex. A 30-second minimum prevents annotators from rushing through without carefully reading the code changes. Adjust based on your average trace length.

Provide calibration examples. Before starting production annotation, have all annotators label the same 10-20 traces and discuss disagreements. This dramatically improves consistency.