# Annotating Agent Trajectories

Source: https://www.potatoannotator.com/docs/guides/agent-trajectory-annotation

**A trajectory is the full sequence of steps an agent took, its thoughts, tool calls, and observations. Annotating a trajectory means judging the run as a whole and marking where individual steps went wrong, with a category and a severity for each error.** This is the richest form of agent evaluation and the data behind reward models and targeted debugging.

For the feature reference, see [Agentic Annotation](/docs/features/agentic-annotation).

## What you collect

- **Overall outcome**: success, partial success, or failure.
- **Per-step judgments**: for each step, was it correct, unnecessary, or wrong?
- **Error categories**: *why* a step was wrong (wrong tool, bad arguments, hallucination, looping, unsafe action…).
- **Severity**: how bad each error was, often weighted into a score.

## Setting up trajectory evaluation

Potato's `trajectory_eval` type renders each step as a card and attaches a per-step error taxonomy with severity weights:

```yaml
annotation_schemes:
  - annotation_type: trajectory_eval
    name: step_evaluation
    description: "Evaluate each step for correctness and mark any errors."
    steps_key: steps
    error_types:
      - {name: reasoning,  subtypes: [logical_error, factual_error, planning_error]}
      - {name: execution,  subtypes: [wrong_tool, wrong_args, api_error]}
      - {name: safety,     subtypes: [harmful_action, data_leak, scope_violation]}
    severities:
      - {name: minor,    weight: -1}
      - {name: major,    weight: -5}
      - {name: critical, weight: -10}
    show_score: true
```

The severity weights roll up into a trajectory score, so you can rank runs and track regressions across model versions.

## Designing a good error taxonomy

The taxonomy is the heart of the task. Keep it small, exhaustive, and mutually exclusive. A practical starting set:

- **Reasoning errors**: wrong conclusion, ignored evidence, bad plan.
- **Execution errors**: wrong tool, malformed call, mishandled result.
- **Safety errors**: unsafe action, out-of-scope behavior, data exposure.

Add a free-text "other" so annotators aren't forced to misfile novel failures, then promote recurring "other" notes into named categories.

## Quality considerations

- Agreement on *step correctness* is usually high; agreement on *error category* is lower. Measure both, see [Inter-Annotator Agreement](/docs/guides/inter-annotator-agreement).
- Long trajectories are fatiguing; cap length or paginate.
- The "first wrong step" is often what matters most for training, see [Process Reward Models](/docs/guides/process-reward-models).

## Further reading

- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents)
- [Process Reward Models](/docs/guides/process-reward-models)
- [Evaluating Tool Use](/docs/guides/tool-use-evaluation)
