# How to Evaluate AI Agents

Source: https://www.potatoannotator.com/docs/guides/evaluating-ai-agents

**Evaluating an AI agent means judging not just its final answer but the *path* it took, the reasoning, tool calls, and actions along the way. Human annotation is still the gold standard for this, because many agent failures (a plausible-but-wrong step, an unsafe action) only a person can reliably catch.** Potato provides purpose-built tools for every level of agent evaluation.

An [AI agent](https://en.wikipedia.org/wiki/Intelligent_agent) here means an LLM-driven system that takes multi-step actions, calling tools, browsing, or writing code, to complete a task. See the [agent evaluation overview](/agent-evaluation) and the [Agentic Annotation](/docs/features/agentic-annotation) reference.

## Four levels of evaluation

Pick the level that matches the question you're asking:

- **Trajectory level**: judge the whole run. Did it succeed? Was it efficient and safe? See [Annotating Agent Trajectories](/docs/guides/agent-trajectory-annotation).
- **Step level**: judge each action. Was this tool call correct? Was this step necessary? This is the data behind [process reward models](/docs/guides/process-reward-models).
- **Span level**: highlight specific problems inside outputs, such as a hallucinated claim or an unsafe instruction. See [Detecting Hallucinations](/docs/guides/detecting-hallucinations).
- **Comparison level**: judge two agents or two runs head-to-head. See [Pairwise Model Comparison](/docs/guides/pairwise-model-comparison).

## What Potato ingests

Potato reads agent traces from 13 formats, including OpenAI and Anthropic tool calls, [ReAct](https://arxiv.org/abs/2210.03629), LangChain, LangFuse, WebArena, SWE-bench, MCP, and OpenTelemetry, and renders them in displays tuned for the kind of agent:

- **Agent trace display** for reasoning/tool traces.
- **Web agent display** with screenshots and action overlays, see [Web-Agent Evaluation](/docs/guides/web-agent-evaluation).
- **Coding trace display** with diffs and terminal output, see [Coding-Agent Evaluation](/docs/guides/coding-agent-evaluation).
- **Live agent display** to watch and steer an agent in real time, see [Live Agent Evaluation](/docs/guides/live-agent-evaluation).

## Choosing an approach

| Your question | Approach |
|---|---|
| "Did the agent complete the task?" | Trajectory success label |
| "Where exactly did it go wrong?" | Step-level error taxonomy |
| "Which version is better?" | Pairwise comparison |
| "How good is it on several axes?" | Rubric evaluation |
| "Is the retrieved-context answer faithful?" | RAG evaluation |

## Further reading

- [Agent evaluation overview](/agent-evaluation)
- [Annotating Agent Trajectories](/docs/guides/agent-trajectory-annotation)
- [Rubric-Based LLM Evaluation](/docs/guides/rubric-based-llm-evaluation)
