Questa pagina non è ancora disponibile nella tua lingua. Viene mostrata la versione in inglese.

How to Evaluate AI Agents

An overview of evaluating AI agents and LLMs with human annotation, trajectory, step, span, and comparison-level evaluation, and which Potato tool fits each.

Evaluating an AI agent means judging not just its final answer but the path it took, the reasoning, tool calls, and actions along the way. Human annotation is still the gold standard for this, because many agent failures (a plausible-but-wrong step, an unsafe action) only a person can reliably catch. Potato is an open-source tool for human annotation of LLM agent trajectories, with a purpose-built display for each level of evaluation.

An AI agent here means an LLM-driven system that takes multi-step actions, calling tools, browsing, or writing code, to complete a task. See the agent evaluation overview and the Agentic Annotation reference.

What are the levels of AI agent evaluation?

Pick the level that matches the question you're asking:

Trajectory level: judge the whole run. Did it succeed? Was it efficient and safe? See Annotating Agent Trajectories.
Step level: judge each action. Was this tool call correct? Was this step necessary? This is the data behind process reward models.
Span level: highlight specific problems inside outputs, such as a hallucinated claim or an unsafe instruction. See Detecting Hallucinations.
Comparison level: judge two agents or two runs head-to-head. See Pairwise Model Comparison.
Team level: for multi-agent systems, attribute a failure to the responsible agent, step, and handoff. See How to Evaluate Multi-Agent Systems.

What agent trace formats does Potato support?

Potato reads agent traces from 13 formats, including OpenAI and Anthropic tool calls, ReAct, LangChain, LangFuse, WebArena, SWE-bench, MCP, and OpenTelemetry, and renders them in displays tuned for the kind of agent:

Agent trace display for reasoning/tool traces.
Web agent display with screenshots and action overlays, see Web-Agent Evaluation.
Coding trace display with diffs and terminal output, see Coding-Agent Evaluation.
Live agent display to watch and steer an agent in real time, see Live Agent Evaluation.
Multimodal agent displays for computer-use, voice, and video agents, see Evaluating Computer-Use and Multimodal Agents.

Which agent evaluation method should I choose?

Your question	Approach
"Did the agent complete the task?"	Trajectory success label
"Where exactly did it go wrong?"	Step-level error taxonomy
"Which version is better?"	Pairwise comparison
"How good is it on several axes?"	Rubric evaluation
"Is the retrieved-context answer faithful?"	RAG evaluation
"Which agent in the team caused the failure?"	Multi-agent attribution
"Did the computer-use agent click the right thing?"	GUI trajectory review

How to Evaluate AI Agents

What are the levels of AI agent evaluation?

What agent trace formats does Potato support?

Which agent evaluation method should I choose?

Further reading