Programmatic Evaluators
Score agent trajectories and text outputs automatically with Potato's Flask-free evaluator library — deterministic trajectory match, tool-use correctness, reference-free LLM-as-judge, and heuristics (exact match, edit distance, JSON, embeddings).
Potato ships a dependency-light evaluator library that scores agent trajectories and text outputs automatically — the deterministic and LLM-as-judge checks that complement human annotation. The same evaluators run inside experiments, the automation engine, and the CI plugin, and they work standalone.
Every evaluator returns one normalized result — a score (conventionally 0.0–1.0, higher is better), a value, a comment, and metadata — so deterministic, heuristic, and LLM-judge evaluators are interchangeable. Trajectories may be passed as OpenAI-style message lists, Potato's canonical conversation turns, or a CanonicalTrace; normalization is automatic.
Trajectory match (deterministic)
Compares an agent's tool-call sequence to a reference.
from potato.evaluators import TrajectoryMatchEvaluator
ev = TrajectoryMatchEvaluator(
mode="unordered", # strict | unordered | subset | superset
tool_args_match_mode="subset", # exact | ignore | subset | superset
tool_args_match_overrides={"search": "ignore"},
)
result = ev.evaluate(outputs=agent_trace, reference_outputs=gold_trace)mode | Passes when… |
|---|---|
strict | identical tool calls, same order |
unordered | same multiset of tool calls, any order |
subset | the agent called only tools that appear in the reference |
superset | the agent called at least the reference tools (extras allowed) |
Argument comparison is independently configurable (exact / ignore / subset / superset), with per-tool overrides.
Tool-use correctness
from potato.evaluators import ToolUseEvaluator, ToolCallAccuracyEvaluator
# Did the agent call a specific tool (optionally with expected args)?
ToolUseEvaluator(expected_tool="submit", expected_args={"id": 1}).evaluate(outputs=trace)
# What fraction of reference tool calls did the agent reproduce? (partial credit)
ToolCallAccuracyEvaluator(args_match_mode="exact").evaluate(outputs=trace, reference_outputs=gold)LLM-as-judge (reference-free)
Scores trajectory quality without a gold reference, since many valid agent paths exist. Reuses the same ai_support endpoint config as the rest of Potato (OpenAI, Anthropic, Ollama, vLLM, …).
from potato.evaluators import LLMTrajectoryJudge
judge = LLMTrajectoryJudge(config=task_config, continuous=True) # 0.0–1.0 score
result = judge.evaluate(outputs=agent_trace, inputs=task_prompt)Heuristic / code evaluators
ExactMatch, Contains, RegexMatch, EditDistance, JSONValid, JSONSchemaMatch, and EmbeddingDistance (lazy ML import or an injected embedding function). Importing the library never pulls the ML stack.
Graph-trajectory eval (LangGraph)
For LangGraph node/transition evaluation, Potato reuses the MIT-licensed agentevals package through a lazy adapter — install it only if you need it.
Configuring evaluators declaratively
A registry maps names → evaluators so they can be configured in YAML (used by the experiment runner and automation engine):
from potato.evaluators import build_evaluator, list_evaluators
ev = build_evaluator("trajectory_match", {"mode": "unordered"})
result = ev.evaluate(outputs=trace, reference_outputs=gold)Related
- Full reference on Read the Docs — every evaluator and option, version-matched
- Datasets & Experiments — run evaluators over a dataset and track scores over time
- CI Evaluation — gate your build on evaluator scores
- Trajectory Evaluation — the human counterpart