Skip to content
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

Process Reward Models and Step-Level Labeling

How to collect process reward (PRM) data by labeling agent steps as correct or incorrect, first-error and per-step modes, with Potato.

A process reward model (PRM) scores the reasoning steps an agent takes, not just its final answer. Training one needs step-level labels: for each step in a trajectory, was it correct? This data is what lets a model learn to reason well, not just to land on the right answer by luck.

PRMs contrast with outcome reward models, which score only the final result. Labeling at the step level catches the case where a model reaches the right answer through flawed reasoning. For the feature reference see Process Reward Annotation.

Two labeling modes

Potato's process_reward type supports the two standard schemes:

  • First-error mode: the annotator marks the first step that goes wrong; every step after it is automatically treated as compromised. Fast, and well-matched to how reasoning failures cascade.
  • Per-step mode: the annotator judges every step independently as correct or incorrect. More granular, more effort.
yaml
annotation_schemes:
  - annotation_type: process_reward
    name: step_rewards
    description: "Mark the first incorrect step. Steps after it are flagged automatically."
    steps_key: structured_turns
    mode: first_error
    first_error:
      correct_color: "#22c55e"
      error_color: "#ef4444"
      downstream_color: "#f97316"
      require_confirmation: true

The colors make the cascade visible: green steps are good, the red step is the first error, and orange marks the now-suspect downstream steps.

When to use which mode

  • First-error for math, coding, and chained reasoning where one mistake invalidates the rest. Cheaper and usually sufficient.
  • Per-step when steps are independent, or when you need a dense reward signal for every step.

Quality considerations

  • Define "correct step" precisely: correct and useful, or merely not-wrong? A redundant-but-harmless step needs a rule.
  • Reasoning is subjective at the margins, collect overlap on a sample and check agreement.
  • Pair with a trajectory-level outcome label so you can study where good outcomes hide bad reasoning. See Annotating Agent Trajectories.

Further reading