Skip to content
このページはまだお使いの言語に翻訳されていません。英語版を表示しています。

Annotating Agent Trajectories

How to annotate AI agent trajectories step by step, error taxonomies, severity scoring, and trajectory-level success, using Potato's trajectory evaluation.

A trajectory is the full sequence of steps an agent took, its thoughts, tool calls, and observations. Annotating a trajectory means judging the run as a whole and marking where individual steps went wrong, with a category and a severity for each error. This is the richest form of agent evaluation and the data behind reward models and targeted debugging.

For the feature reference, see Agentic Annotation.

What you collect

  • Overall outcome: success, partial success, or failure.
  • Per-step judgments: for each step, was it correct, unnecessary, or wrong?
  • Error categories: why a step was wrong (wrong tool, bad arguments, hallucination, looping, unsafe action…).
  • Severity: how bad each error was, often weighted into a score.

Setting up trajectory evaluation

Potato's trajectory_eval type renders each step as a card and attaches a per-step error taxonomy with severity weights:

yaml
annotation_schemes:
  - annotation_type: trajectory_eval
    name: step_evaluation
    description: "Evaluate each step for correctness and mark any errors."
    steps_key: steps
    error_types:
      - {name: reasoning,  subtypes: [logical_error, factual_error, planning_error]}
      - {name: execution,  subtypes: [wrong_tool, wrong_args, api_error]}
      - {name: safety,     subtypes: [harmful_action, data_leak, scope_violation]}
    severities:
      - {name: minor,    weight: -1}
      - {name: major,    weight: -5}
      - {name: critical, weight: -10}
    show_score: true

The severity weights roll up into a trajectory score, so you can rank runs and track regressions across model versions.

Designing a good error taxonomy

The taxonomy is the heart of the task. Keep it small, exhaustive, and mutually exclusive. A practical starting set:

  • Reasoning errors: wrong conclusion, ignored evidence, bad plan.
  • Execution errors: wrong tool, malformed call, mishandled result.
  • Safety errors: unsafe action, out-of-scope behavior, data exposure.

Add a free-text "other" so annotators aren't forced to misfile novel failures, then promote recurring "other" notes into named categories.

Quality considerations

  • Agreement on step correctness is usually high; agreement on error category is lower. Measure both, see Inter-Annotator Agreement.
  • Long trajectories are fatiguing; cap length or paginate.
  • The "first wrong step" is often what matters most for training, see Process Reward Models.

Further reading