# Programmatic Evaluators

Source: https://www.potatoannotator.com/docs/agent-evaluation/programmatic-evaluators

**Potato ships a dependency-light evaluator library that scores agent trajectories and text outputs automatically** — the deterministic and LLM-as-judge checks that complement human annotation. The same evaluators run inside experiments, the automation engine, and the [CI plugin](/docs/agent-evaluation/ci-evaluation), and they work standalone.

Every evaluator returns one normalized result — a `score` (conventionally 0.0–1.0, higher is better), a `value`, a `comment`, and `metadata` — so deterministic, heuristic, and LLM-judge evaluators are interchangeable. Trajectories may be passed as OpenAI-style message lists, Potato's canonical `conversation` turns, or a `CanonicalTrace`; normalization is automatic.

## Trajectory match (deterministic)

Compares an agent's tool-call sequence to a reference.

```python
from potato.evaluators import TrajectoryMatchEvaluator

ev = TrajectoryMatchEvaluator(
    mode="unordered",               # strict | unordered | subset | superset
    tool_args_match_mode="subset",  # exact | ignore | subset | superset
    tool_args_match_overrides={"search": "ignore"},
)
result = ev.evaluate(outputs=agent_trace, reference_outputs=gold_trace)
```

| `mode` | Passes when… |
|--------|--------------|
| `strict` | identical tool calls, same order |
| `unordered` | same multiset of tool calls, any order |
| `subset` | the agent called only tools that appear in the reference |
| `superset` | the agent called at least the reference tools (extras allowed) |

Argument comparison is independently configurable (`exact` / `ignore` / `subset` / `superset`), with per-tool overrides.

## Tool-use correctness

```python
from potato.evaluators import ToolUseEvaluator, ToolCallAccuracyEvaluator

# Did the agent call a specific tool (optionally with expected args)?
ToolUseEvaluator(expected_tool="submit", expected_args={"id": 1}).evaluate(outputs=trace)

# What fraction of reference tool calls did the agent reproduce? (partial credit)
ToolCallAccuracyEvaluator(args_match_mode="exact").evaluate(outputs=trace, reference_outputs=gold)
```

## LLM-as-judge (reference-free)

Scores trajectory *quality* without a gold reference, since many valid agent paths exist. Reuses the same `ai_support` endpoint config as the rest of Potato (OpenAI, Anthropic, Ollama, vLLM, …).

```python
from potato.evaluators import LLMTrajectoryJudge

judge = LLMTrajectoryJudge(config=task_config, continuous=True)  # 0.0–1.0 score
result = judge.evaluate(outputs=agent_trace, inputs=task_prompt)
```

## Heuristic / code evaluators

`ExactMatch`, `Contains`, `RegexMatch`, `EditDistance`, `JSONValid`, `JSONSchemaMatch`, and `EmbeddingDistance` (lazy ML import or an injected embedding function). Importing the library never pulls the ML stack.

## Graph-trajectory eval (LangGraph)

For LangGraph node/transition evaluation, Potato reuses the MIT-licensed [`agentevals`](https://github.com/langchain-ai/agentevals) package through a lazy adapter — install it only if you need it.

## Configuring evaluators declaratively

A registry maps names → evaluators so they can be configured in YAML (used by the experiment runner and automation engine):

```python
from potato.evaluators import build_evaluator, list_evaluators

ev = build_evaluator("trajectory_match", {"mode": "unordered"})
result = ev.evaluate(outputs=trace, reference_outputs=gold)
```

## Related

- [Full reference on Read the Docs](https://potatoannotator.readthedocs.io/en/latest/agent-evaluation/evaluators/) — every evaluator and option, version-matched
- [Datasets & Experiments](/docs/agent-evaluation/datasets-and-experiments) — run evaluators over a dataset and track scores over time
- [CI Evaluation](/docs/agent-evaluation/ci-evaluation) — gate your build on evaluator scores
- [Trajectory Evaluation](/docs/agent-evaluation/trajectory-correction) — the human counterpart
