Skip to content
Questa pagina non è ancora disponibile nella tua lingua. Viene mostrata la versione in inglese.

How to Evaluate AI Agents

An overview of evaluating AI agents and LLMs with human annotation, trajectory, step, span, and comparison-level evaluation, and which Potato tool fits each.

Evaluating an AI agent means judging not just its final answer but the path it took, the reasoning, tool calls, and actions along the way. Human annotation is still the gold standard for this, because many agent failures (a plausible-but-wrong step, an unsafe action) only a person can reliably catch. Potato provides purpose-built tools for every level of agent evaluation.

An AI agent here means an LLM-driven system that takes multi-step actions, calling tools, browsing, or writing code, to complete a task. See the agent evaluation overview and the Agentic Annotation reference.

Four levels of evaluation

Pick the level that matches the question you're asking:

What Potato ingests

Potato reads agent traces from 13 formats, including OpenAI and Anthropic tool calls, ReAct, LangChain, LangFuse, WebArena, SWE-bench, MCP, and OpenTelemetry, and renders them in displays tuned for the kind of agent:

Choosing an approach

Your questionApproach
"Did the agent complete the task?"Trajectory success label
"Where exactly did it go wrong?"Step-level error taxonomy
"Which version is better?"Pairwise comparison
"How good is it on several axes?"Rubric evaluation
"Is the retrieved-context answer faithful?"RAG evaluation

Further reading