Annotating Agent Trajectories
How to annotate AI agent trajectories step by step, error taxonomies, severity scoring, and trajectory-level success, using Potato's trajectory evaluation.
A trajectory is the full sequence of steps an agent took, its thoughts, tool calls, and observations. Annotating a trajectory means judging the run as a whole and marking where individual steps went wrong, with a category and a severity for each error. This is the richest form of agent evaluation and the data behind reward models and targeted debugging.
For the feature reference, see Agentic Annotation.
What you collect
- Overall outcome: success, partial success, or failure.
- Per-step judgments: for each step, was it correct, unnecessary, or wrong?
- Error categories: why a step was wrong (wrong tool, bad arguments, hallucination, looping, unsafe action…).
- Severity: how bad each error was, often weighted into a score.
Setting up trajectory evaluation
Potato's trajectory_eval type renders each step as a card and attaches a per-step error taxonomy with severity weights:
annotation_schemes:
- annotation_type: trajectory_eval
name: step_evaluation
description: "Evaluate each step for correctness and mark any errors."
steps_key: steps
error_types:
- {name: reasoning, subtypes: [logical_error, factual_error, planning_error]}
- {name: execution, subtypes: [wrong_tool, wrong_args, api_error]}
- {name: safety, subtypes: [harmful_action, data_leak, scope_violation]}
severities:
- {name: minor, weight: -1}
- {name: major, weight: -5}
- {name: critical, weight: -10}
show_score: trueThe severity weights roll up into a trajectory score, so you can rank runs and track regressions across model versions.
Designing a good error taxonomy
The taxonomy is the heart of the task. Keep it small, exhaustive, and mutually exclusive. A practical starting set:
- Reasoning errors: wrong conclusion, ignored evidence, bad plan.
- Execution errors: wrong tool, malformed call, mishandled result.
- Safety errors: unsafe action, out-of-scope behavior, data exposure.
Add a free-text "other" so annotators aren't forced to misfile novel failures, then promote recurring "other" notes into named categories.
Quality considerations
- Agreement on step correctness is usually high; agreement on error category is lower. Measure both, see Inter-Annotator Agreement.
- Long trajectories are fatiguing; cap length or paginate.
- The "first wrong step" is often what matters most for training, see Process Reward Models.