# How to Get Reliable Labels on Agent Trajectories

Source: https://www.potatoannotator.com/blog/annotating-agent-trajectories-reliably

Labeling a tweet is one decision. Labeling an agent trajectory is dozens: every step in the run is a small judgment, and the judgments depend on each other. That is what makes trajectory annotation its own problem. Two careful people will usually agree on whether a run succeeded, then disagree about which of its twelve steps was the one that went wrong. If you do not design for that gap, your "labeled" trajectory data is quietly unreliable, and the reward model or the debugging analysis you build on it inherits the noise.

**A trajectory is the full trace of an agent's run: its goal, and then each step's reasoning, tool call, and observation. Annotating one means judging the run overall and marking where individual steps went wrong. The whole-run label is easy to agree on; the step-level labels are not, and that is exactly where the useful signal lives. Reliable trajectory data comes from a tight error taxonomy, per-step agreement measured on purpose, and a way to resolve disagreement.** This post is about getting those three things right.

## What makes a trajectory hard to annotate

The difficulty is not the volume of steps. It is that the steps are not independent.

- **Error attribution.** When a run fails, the visible failure is often several steps downstream of the real mistake. The agent made a bad plan at step 3, and it surfaced as a wrong answer at step 11. Two annotators watching the same run can both be right that "this step is wrong" while disagreeing about which step to blame.
- **Cascading effects.** Once a step goes wrong, everything after it is contaminated. Is a later step "wrong" on its own terms, or only wrong because it inherited a bad state? Annotators need a rule for this, or they will split.
- **Hidden state.** The agent's reasoning is not always in the trace, and its tools have side effects you cannot see. [τ-bench (Yao et al., 2024)](https://arxiv.org/abs/2406.12045) handles this by checking the database state at the end of a run against an annotated goal state, because you often cannot judge correctness from the transcript alone.
- **Subjective "necessity."** Whether a step was *unnecessary* rather than *wrong* is a judgment call, and it is one of the least reliable labels in practice. A redundant search is not an error, but it is not clean either.

## Design the taxonomy before you label

The taxonomy is the part that decides your data quality, and it is worth building deliberately. The strongest evidence for this is that the good agent-failure datasets were built exactly this way. [MAST, the Multi-Agent System Failure Taxonomy (Cemri et al., 2025)](https://arxiv.org/abs/2503.13657), came out of expert annotators studying 150 traces, iterating on categories until they reached a Cohen's kappa of 0.88, and only then scaling to 1600-plus traces across its 14 failure modes. The reliability came from the taxonomy work, not from more annotators.

![Anatomy of an agent trajectory: a goal, then a sequence of steps each with a thought, a tool call, and an observation, ending in an outcome, with per-step judgment, error category, and severity layered on top.](/images/blog/anatomy-of-a-trajectory.svg "A trajectory is a goal, a chain of thought-action-observation steps, and an outcome, each step carrying its own labels")

A workable taxonomy is small, close to exhaustive, and mutually exclusive. Three top-level categories cover most agent failures:

- **Reasoning errors**: a wrong conclusion, ignored evidence, a bad plan.
- **Execution errors**: the wrong tool, a malformed call, a mishandled result.
- **Safety errors**: an unsafe action, out-of-scope behavior, data exposure.

Give annotators a free-text "other" so a novel failure has somewhere to go instead of being crammed into the nearest category, then watch the "other" notes and promote recurring ones into named categories. [AgentRewardBench (Lù et al., 2025)](https://arxiv.org/abs/2504.08942) is a useful model for what to capture at the run level: its expert reviewers judged each of 1302 trajectories on success, side effects, and repetitiveness, three axes that a single success flag would have collapsed.

![An agent error taxonomy tree: reasoning, execution, and safety categories, each splitting into named subtypes, with an open other branch.](/images/blog/agent-error-taxonomy.svg "A small, mutually exclusive taxonomy with an escape hatch for novel failures")

## Measuring agreement on multi-step labels

Overall success is the easy label. Two people watch a run and mostly agree it worked or it did not. If that is the only number you report, your data looks more reliable than it is.

Measure agreement where it is actually hard. Compute [inter-annotator agreement](/docs/guides/inter-annotator-agreement) separately for step correctness and for error category, because they behave differently: people agree on *whether* a step is wrong far more than on *why*. Line up the first-wrong-step calls across annotators, since that single step is what matters most for training a process reward model, in the [PRM800K / "Let's Verify Step by Step" (Lightman et al., 2023)](https://arxiv.org/abs/2305.20050) sense. And treat automatic evaluation with suspicion: AgentRewardBench found that the rule-based checks common benchmarks rely on tend to underreport agent success, so a cheap automatic label is not a substitute for the human one, only a first pass.

## Adjudicating disagreement and onboarding annotators

Disagreement on trajectories is not noise to average away. It is usually a sign that the taxonomy has a soft spot, and it is information. When two annotators split on which step to blame, that pair of labels tells you the cascade rule is underspecified, and the fix goes back into the guidelines.

Two practices carry most of the weight. First, adjudicate the disagreements rather than voting them, because on a trajectory the "why did you pick that step" conversation is where the real rule gets written; see [Adjudication and Resolving Disagreement](/docs/guides/adjudication-and-disagreement). Second, onboard annotators on long traces slowly. Trajectories are fatiguing, and a tired annotator on step 40 is not the same instrument as a fresh one on step 2. Cap or paginate long runs, and calibrate everyone on a shared set of traces before they work independently.

## Doing it in Potato

Potato's `trajectory_eval` type renders each step as a card and attaches a per-step error taxonomy with severity weights, so the labels above become a config rather than a spreadsheet convention.

```yaml
annotation_schemes:
  - annotation_type: trajectory_eval
    name: step_evaluation
    description: "Evaluate each step for correctness and mark any errors."
    steps_key: steps
    error_types:
      - {name: reasoning,  subtypes: [logical_error, factual_error, planning_error]}
      - {name: execution,  subtypes: [wrong_tool, wrong_args, api_error]}
      - {name: safety,     subtypes: [harmful_action, data_leak, scope_violation]}
    severities:
      - {name: minor,    weight: -1}
      - {name: major,    weight: -5}
      - {name: critical, weight: -10}
    show_score: true
```

The severity weights roll up into a trajectory score, so you can rank runs and track regressions across model versions. When the goal is specifically the first wrong step for reward-model training, the [`process_reward`](/docs/guides/process-reward-models) type has a first-error mode built for it. Potato imports traces from 13 formats into a common step view, so you can annotate a run whatever framework produced it; see [Agentic Annotation](/docs/features/agentic-annotation).

## Further reading

- [Annotating Agent Trajectories](/docs/guides/agent-trajectory-annotation), the step-by-step feature reference.
- [Process Reward Models and Step-Level Labeling](/docs/guides/process-reward-models), for first-error and per-step reward data.
- [Evaluating Tool Use and Function Calling](/docs/guides/tool-use-evaluation), for judging individual tool calls.
- [Inter-Annotator Agreement Explained](/docs/guides/inter-annotator-agreement), for the reliability statistics this all rests on.

The showcase pages built from real agent benchmarks show the schemes in context: [WebArena](/showcase/webarena-web-agent-eval), [τ-bench](/showcase/tau-bench-agent-eval), and [AgentRewardBench](/showcase/agentrewardbench-trajectory-scoring).
