此页面尚未提供您所选语言的版本，当前显示英文版本。

Programmatic Evaluators

Score agent trajectories and text outputs automatically with Potato's Flask-free evaluator library — deterministic trajectory match, tool-use correctness, reference-free LLM-as-judge, and heuristics (exact match, edit distance, JSON, embeddings).

Potato ships a dependency-light evaluator library that scores agent trajectories and text outputs automatically — the deterministic and LLM-as-judge checks that complement human annotation. The same evaluators run inside experiments, the automation engine, and the CI plugin, and they work standalone.

Every evaluator returns one normalized result — a score (conventionally 0.0–1.0, higher is better), a value, a comment, and metadata — so deterministic, heuristic, and LLM-judge evaluators are interchangeable. Trajectories may be passed as OpenAI-style message lists, Potato's canonical conversation turns, or a CanonicalTrace; normalization is automatic.

Trajectory match (deterministic)

Compares an agent's tool-call sequence to a reference.

python

from potato.evaluators import TrajectoryMatchEvaluator
 
ev = TrajectoryMatchEvaluator(
    mode="unordered",               # strict | unordered | subset | superset
    tool_args_match_mode="subset",  # exact | ignore | subset | superset
    tool_args_match_overrides={"search": "ignore"},
)
result = ev.evaluate(outputs=agent_trace, reference_outputs=gold_trace)

`mode`	Passes when…
`strict`	identical tool calls, same order
`unordered`	same multiset of tool calls, any order
`subset`	the agent called only tools that appear in the reference
`superset`	the agent called at least the reference tools (extras allowed)

Argument comparison is independently configurable (exact / ignore / subset / superset), with per-tool overrides.

Tool-use correctness

python

from potato.evaluators import ToolUseEvaluator, ToolCallAccuracyEvaluator
 
# Did the agent call a specific tool (optionally with expected args)?
ToolUseEvaluator(expected_tool="submit", expected_args={"id": 1}).evaluate(outputs=trace)
 
# What fraction of reference tool calls did the agent reproduce? (partial credit)
ToolCallAccuracyEvaluator(args_match_mode="exact").evaluate(outputs=trace, reference_outputs=gold)

LLM-as-judge (reference-free)

Scores trajectory quality without a gold reference, since many valid agent paths exist. Reuses the same ai_support endpoint config as the rest of Potato (OpenAI, Anthropic, Ollama, vLLM, …).

python

from potato.evaluators import LLMTrajectoryJudge
 
judge = LLMTrajectoryJudge(config=task_config, continuous=True)  # 0.0–1.0 score
result = judge.evaluate(outputs=agent_trace, inputs=task_prompt)

Heuristic / code evaluators

ExactMatch, Contains, RegexMatch, EditDistance, JSONValid, JSONSchemaMatch, and EmbeddingDistance (lazy ML import or an injected embedding function). Importing the library never pulls the ML stack.

Graph-trajectory eval (LangGraph)

For LangGraph node/transition evaluation, Potato reuses the MIT-licensed agentevals package through a lazy adapter — install it only if you need it.

Configuring evaluators declaratively

A registry maps names → evaluators so they can be configured in YAML (used by the experiment runner and automation engine):

python

from potato.evaluators import build_evaluator, list_evaluators
 
ev = build_evaluator("trajectory_match", {"mode": "unordered"})
result = ev.evaluate(outputs=trace, reference_outputs=gold)

Full reference on Read the Docs — every evaluator and option, version-matched
Datasets & Experiments — run evaluators over a dataset and track scores over time
CI Evaluation — gate your build on evaluator scores
Trajectory Evaluation — the human counterpart

Programmatic Evaluators

Trajectory match (deterministic)

Tool-use correctness

LLM-as-judge (reference-free)

Heuristic / code evaluators

Graph-trajectory eval (LangGraph)

Configuring evaluators declaratively

Related