Skip to content
Esta página ainda não está disponível no seu idioma. Exibindo a versão em inglês.

Programmatic Evaluators

Score agent trajectories and text outputs automatically with Potato's Flask-free evaluator library — deterministic trajectory match, tool-use correctness, reference-free LLM-as-judge, and heuristics (exact match, edit distance, JSON, embeddings).

Potato ships a dependency-light evaluator library that scores agent trajectories and text outputs automatically — the deterministic and LLM-as-judge checks that complement human annotation. The same evaluators run inside experiments, the automation engine, and the CI plugin, and they work standalone.

Every evaluator returns one normalized result — a score (conventionally 0.0–1.0, higher is better), a value, a comment, and metadata — so deterministic, heuristic, and LLM-judge evaluators are interchangeable. Trajectories may be passed as OpenAI-style message lists, Potato's canonical conversation turns, or a CanonicalTrace; normalization is automatic.

Trajectory match (deterministic)

Compares an agent's tool-call sequence to a reference.

python
from potato.evaluators import TrajectoryMatchEvaluator
 
ev = TrajectoryMatchEvaluator(
    mode="unordered",               # strict | unordered | subset | superset
    tool_args_match_mode="subset",  # exact | ignore | subset | superset
    tool_args_match_overrides={"search": "ignore"},
)
result = ev.evaluate(outputs=agent_trace, reference_outputs=gold_trace)
modePasses when…
strictidentical tool calls, same order
unorderedsame multiset of tool calls, any order
subsetthe agent called only tools that appear in the reference
supersetthe agent called at least the reference tools (extras allowed)

Argument comparison is independently configurable (exact / ignore / subset / superset), with per-tool overrides.

Tool-use correctness

python
from potato.evaluators import ToolUseEvaluator, ToolCallAccuracyEvaluator
 
# Did the agent call a specific tool (optionally with expected args)?
ToolUseEvaluator(expected_tool="submit", expected_args={"id": 1}).evaluate(outputs=trace)
 
# What fraction of reference tool calls did the agent reproduce? (partial credit)
ToolCallAccuracyEvaluator(args_match_mode="exact").evaluate(outputs=trace, reference_outputs=gold)

LLM-as-judge (reference-free)

Scores trajectory quality without a gold reference, since many valid agent paths exist. Reuses the same ai_support endpoint config as the rest of Potato (OpenAI, Anthropic, Ollama, vLLM, …).

python
from potato.evaluators import LLMTrajectoryJudge
 
judge = LLMTrajectoryJudge(config=task_config, continuous=True)  # 0.0–1.0 score
result = judge.evaluate(outputs=agent_trace, inputs=task_prompt)

Heuristic / code evaluators

ExactMatch, Contains, RegexMatch, EditDistance, JSONValid, JSONSchemaMatch, and EmbeddingDistance (lazy ML import or an injected embedding function). Importing the library never pulls the ML stack.

Graph-trajectory eval (LangGraph)

For LangGraph node/transition evaluation, Potato reuses the MIT-licensed agentevals package through a lazy adapter — install it only if you need it.

Configuring evaluators declaratively

A registry maps names → evaluators so they can be configured in YAML (used by the experiment runner and automation engine):

python
from potato.evaluators import build_evaluator, list_evaluators
 
ev = build_evaluator("trajectory_match", {"mode": "unordered"})
result = ev.evaluate(outputs=trace, reference_outputs=gold)