How to Evaluate AI Agents
An overview of evaluating AI agents and LLMs with human annotation, trajectory, step, span, and comparison-level evaluation, and which Potato tool fits each.
Evaluating an AI agent means judging not just its final answer but the path it took, the reasoning, tool calls, and actions along the way. Human annotation is still the gold standard for this, because many agent failures (a plausible-but-wrong step, an unsafe action) only a person can reliably catch. Potato provides purpose-built tools for every level of agent evaluation.
An AI agent here means an LLM-driven system that takes multi-step actions, calling tools, browsing, or writing code, to complete a task. See the agent evaluation overview and the Agentic Annotation reference.
Four levels of evaluation
Pick the level that matches the question you're asking:
- Trajectory level: judge the whole run. Did it succeed? Was it efficient and safe? See Annotating Agent Trajectories.
- Step level: judge each action. Was this tool call correct? Was this step necessary? This is the data behind process reward models.
- Span level: highlight specific problems inside outputs, such as a hallucinated claim or an unsafe instruction. See Detecting Hallucinations.
- Comparison level: judge two agents or two runs head-to-head. See Pairwise Model Comparison.
What Potato ingests
Potato reads agent traces from 13 formats, including OpenAI and Anthropic tool calls, ReAct, LangChain, LangFuse, WebArena, SWE-bench, MCP, and OpenTelemetry, and renders them in displays tuned for the kind of agent:
- Agent trace display for reasoning/tool traces.
- Web agent display with screenshots and action overlays, see Web-Agent Evaluation.
- Coding trace display with diffs and terminal output, see Coding-Agent Evaluation.
- Live agent display to watch and steer an agent in real time, see Live Agent Evaluation.
Choosing an approach
| Your question | Approach |
|---|---|
| "Did the agent complete the task?" | Trajectory success label |
| "Where exactly did it go wrong?" | Step-level error taxonomy |
| "Which version is better?" | Pairwise comparison |
| "How good is it on several axes?" | Rubric evaluation |
| "Is the retrieved-context answer faithful?" | RAG evaluation |