このページはまだお使いの言語に翻訳されていません。英語版を表示しています。

Process Reward Annotation

Collect per-step reward signals for training process reward models with first-error and per-step annotation modes. Export directly to PRM, DPO, and SWE-bench training formats.

New in v2.4.0

Process Reward Models (PRMs) require per-step correctness labels rather than a single outcome-level score. Training an effective PRM means collecting annotations that identify exactly where in a multi-step trace the agent went wrong, what kind of error occurred, and whether recovery was possible. This differs from outcome-based annotation, where you only judge the final result.

Potato provides two annotation modes optimized for different speed/detail tradeoffs when collecting process reward data. First-error mode is designed for fast binary labeling: the annotator clicks the first incorrect step, and all subsequent steps are automatically marked as tainted. Per-step mode asks the annotator to independently rate every step, producing richer signals at the cost of more annotation time.

Both modes integrate with the coding trace display, agent trace display, and web agent display, so you can collect process rewards for any type of agent trace.

First-Error Mode

In first-error mode, the annotator reads through the trace sequentially and clicks on the first step where the agent made an error. All steps before the clicked step are automatically labeled as correct. The clicked step and all subsequent steps are labeled as incorrect (the clicked step as the "first error" and the rest as "downstream of error").

This produces the exact label format needed for training binary PRMs: a sequence of +1 labels followed by a -1 at the error point and -1 for all remaining steps.

Configuration

yaml

annotation_schemes:
  - name: process_reward
    annotation_type: process_reward
    mode: first_error
    description: "Click the first step where the agent made a mistake"
 
    first_error:
      # Visual styling
      correct_color: "#22c55e"     # green for steps before the error
      error_color: "#ef4444"       # red for the first error step
      downstream_color: "#f97316"  # orange for steps after the error
      unmarked_color: "#6b7280"    # gray for steps not yet reviewed
 
      # Behavior
      require_confirmation: true   # ask "Are you sure?" before marking
      allow_no_error: true         # allow annotator to mark all steps correct
      show_step_content: true      # show step content in the annotation panel
 
      # Labels applied automatically
      labels:
        correct: "+1"
        first_error: "-1 (first error)"
        downstream: "-1 (downstream)"
        all_correct: "+1 (all correct)"

Annotation Workflow

The annotator reads the trace from top to bottom
Steps are initially unmarked (gray)
The annotator clicks the first incorrect step
Steps 0 through N-1 turn green (correct)
Step N turns red (first error)
Steps N+1 through the end turn orange (downstream)
If the entire trace is correct, the annotator clicks "All Steps Correct"

Output Format

json

{
  "id": "trace_042",
  "annotations": {
    "process_reward": {
      "mode": "first_error",
      "first_error_step": 4,
      "total_steps": 8,
      "labels": [1, 1, 1, 1, -1, -1, -1, -1]
    }
  }
}

When the annotator marks all steps correct, first_error_step is null and the labels array contains all 1 values.

Per-Step Mode

In per-step mode, the annotator independently rates every step in the trace. This produces richer signals -- a step can be "partially correct" or "unnecessary" rather than just correct/incorrect. It also captures cases where the agent recovers from an error, which first-error mode cannot represent.

Configuration

yaml

annotation_schemes:
  - name: process_reward
    annotation_type: process_reward
    mode: per_step
    description: "Rate each step independently"
 
    per_step:
      # Rating options
      labels:
        - value: "correct"
          display: "Correct"
          color: "#22c55e"
          score: 1.0
        - value: "partially_correct"
          display: "Partially Correct"
          color: "#eab308"
          score: 0.5
        - value: "incorrect"
          display: "Incorrect"
          color: "#ef4444"
          score: -1.0
        - value: "unnecessary"
          display: "Unnecessary"
          color: "#f97316"
          score: -0.5
        - value: "recovery"
          display: "Recovery from Error"
          color: "#3b82f6"
          score: 0.25
 
      # Optional error categorization for incorrect/partially correct steps
      error_categories:
        enabled: true
        categories:
          - "Wrong tool selected"
          - "Correct tool, wrong arguments"
          - "Hallucinated information"
          - "Repeated previous step"
          - "Logic error"
          - "Syntax error"
          - "Missed edge case"
          - "Unnecessary step"
          - "Other"
 
      # Behavior
      require_all_steps: true     # all steps must be rated before submission
      allow_notes: true           # optional text field per step
      show_running_score: true    # show cumulative reward score

Annotation Workflow

Each step in the trace has a rating widget beside it
The annotator selects a label for each step
If the step is rated "Incorrect" or "Partially Correct" and error categories are enabled, a dropdown appears to select the error type
An optional notes field allows free-text explanation
A running score at the top shows the cumulative reward

Output Format

json

{
  "id": "trace_042",
  "annotations": {
    "process_reward": {
      "mode": "per_step",
      "total_steps": 6,
      "labels": [1.0, 1.0, -1.0, 0.25, 1.0, 1.0],
      "step_details": {
        "0": {"label": "correct"},
        "1": {"label": "correct"},
        "2": {
          "label": "incorrect",
          "error_category": "Wrong tool selected",
          "notes": "Agent used grep when it should have read the file directly"
        },
        "3": {
          "label": "recovery",
          "notes": "Agent recognized the mistake and tried a different approach"
        },
        "4": {"label": "correct"},
        "5": {"label": "correct"}
      },
      "cumulative_score": 2.75
    }
  }
}

Configuration Reference

Full configuration options for the process reward annotation scheme:

yaml

annotation_schemes:
  - name: process_reward
    annotation_type: process_reward
    mode: first_error              # "first_error" or "per_step"
    description: "Process reward annotation"
 
    # Required for mode: first_error
    first_error:
      correct_color: "#22c55e"
      error_color: "#ef4444"
      downstream_color: "#f97316"
      unmarked_color: "#6b7280"
      require_confirmation: true
      allow_no_error: true
      show_step_content: true
 
    # Required for mode: per_step
    per_step:
      labels:
        - value: "correct"
          display: "Correct"
          color: "#22c55e"
          score: 1.0
        - value: "incorrect"
          display: "Incorrect"
          color: "#ef4444"
          score: -1.0
      error_categories:
        enabled: false
        categories: []
      require_all_steps: true
      allow_notes: false
      show_running_score: false
 
    # Common options
    target: agentic_steps          # bind to trace steps
    keyboard_shortcuts:
      enabled: true
      correct: "1"
      incorrect: "2"
      partially_correct: "3"
      unnecessary: "4"
      next_step: "j"
      prev_step: "k"

Export to Training Formats

Potato can export process reward annotations directly to formats used by common PRM training pipelines.

PRM Training Format

Export binary step-level labels for PRM training:

bash

python -m potato.export \
  -i output/ \
  -f prm \
  -o results/prm_training_data.jsonl

Output format:

json

{
  "trace_id": "trace_042",
  "steps": [
    {"content": "Search for Tokyo population", "label": 1},
    {"content": "Parse search results", "label": 1},
    {"content": "Search for NYC population", "label": -1},
    {"content": "Compare populations", "label": -1}
  ]
}

DPO / RLHF Preference Pairs

When you have multiple annotated traces for the same task, export pairwise preferences for DPO or RLHF training. The exporter pairs traces where one has a higher cumulative reward than the other:

bash

python -m potato.export \
  -i output/ \
  -f dpo \
  -o results/dpo_pairs.jsonl \
  --min-score-gap 0.5

Output format:

json

{
  "prompt": "Fix the failing test in test_parser.py",
  "chosen": [
    {"role": "assistant", "content": "Step 1: Read the test file..."},
    {"role": "assistant", "content": "Step 2: Identify the bug..."}
  ],
  "rejected": [
    {"role": "assistant", "content": "Step 1: Run all tests..."},
    {"role": "assistant", "content": "Step 2: Edit a random file..."}
  ]
}

SWE-bench Compatible Results

Export evaluation results in a format compatible with SWE-bench leaderboard submissions:

bash

python -m potato.export \
  -i output/ \
  -f swebench \
  -o results/swebench_results.json

This generates the standard SWE-bench evaluation JSON with instance IDs, model patches, and resolution status derived from annotator judgments.

Analysis

Potato provides utility functions for analyzing process reward annotations:

python

from potato.analysis import load_annotations, process_reward_stats
 
# Load annotations
annotations = load_annotations("output/")
 
# Step-level accuracy statistics
stats = process_reward_stats(annotations)
 
print(f"Total traces annotated: {stats['total_traces']}")
print(f"Traces with no errors: {stats['all_correct_count']} ({stats['all_correct_pct']:.1f}%)")
print(f"Average first-error position: step {stats['avg_first_error_step']:.1f}")
print(f"Average steps before error: {stats['avg_correct_prefix_length']:.1f}")
 
# Error distribution by step position
for position, count in stats['error_by_position'].items():
    print(f"  Step {position}: {count} errors")
 
# Error category distribution (per-step mode only)
if 'error_categories' in stats:
    for category, count in stats['error_categories'].items():
        print(f"  {category}: {count}")
 
# Inter-annotator agreement on first-error step
if stats['multi_annotator']:
    print(f"First-error agreement (exact): {stats['first_error_exact_agreement']:.2f}")
    print(f"First-error agreement (within 1): {stats['first_error_near_agreement']:.2f}")

Visualization

python

from potato.analysis import plot_error_distribution
 
# Plot error position distribution across all traces
plot_error_distribution(
    annotations,
    output_path="figures/error_distribution.png",
    normalize_by_trace_length=True,
    title="Where Do Agents First Go Wrong?"
)
 
# Plot per-step reward curves
from potato.analysis import plot_reward_curves
 
plot_reward_curves(
    annotations,
    output_path="figures/reward_curves.png",
    group_by="agent_model",
    title="Cumulative Reward by Model"
)

Research Context

Process reward annotation in Potato is designed to support research on training and evaluating reward models for agentic systems. Several recent lines of work motivate this feature:

AgentPRM demonstrates that process reward models trained on step-level labels significantly outperform outcome reward models for guiding coding agents during search.
ToolRM and ToolRL show that reward models specialized for tool-use steps can improve agent performance on API-calling and code generation tasks.
DeepSWE applies process reward models to SWE-bench-scale software engineering tasks, using per-step labels to train verifiers that guide agent tree search.
Step-level RLHF research shows that per-step human feedback produces more sample-efficient reward models than episode-level feedback.

Potato's first-error and per-step modes map directly to the label formats used by these approaches. The export pipeline produces training-ready data without additional preprocessing.

Process Reward Annotation

First-Error Mode

Configuration

Annotation Workflow

Output Format

Per-Step Mode

Configuration

Annotation Workflow

Output Format

Configuration Reference

Export to Training Formats

PRM Training Format

DPO / RLHF Preference Pairs

SWE-bench Compatible Results

Analysis

Visualization

Research Context

See Also