Skip to content

CI Evaluation

Run Potato evaluations inside your own pytest suite and gate CI on score thresholds, so a prompt or model change that regresses agent quality fails the build like a unit test. Includes a fluent expect() API and a ready-to-copy GitHub Actions workflow.

Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of programmatic evaluators and datasets & experiments.

Install

The plugin ships with Potato and auto-loads once installed:

bash
pip install -e .          # registers the `potato_eval` pytest plugin

Without installing, load it explicitly: pytest -p potato.testing.pytest_plugin.

Write eval tests

Mark a test @pytest.mark.potato_eval and request the potato_eval fixture:

python
import pytest
from potato.testing import expect
 
@pytest.mark.potato_eval
@pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES])
def test_agent(case, potato_eval):
    out = my_agent(case["q"])
    potato_eval.log_inputs({"question": case["q"]})
    potato_eval.log_outputs(out)
    potato_eval.log_reference_outputs(case["expected"])
 
    potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0)
    potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value)
 
    expect(out).to_contain(case["expected"])   # per-case hard assertion

expect(...) offers .to_equal, .to_contain, .to_be_less_than, .to_be_greater_than, .to_be_between, .to_be_close_to, plus expect.edit_distance(a, b) and expect.embedding_distance(a, b). log_feedback scores are aggregated (mean per key) across all eval tests.

Gate the build

bash
pytest tests/eval/ \
  --potato-threshold correct=0.8 \
  --potato-threshold similarity=0.7 \
  --potato-experiment agent-regression
OptionEffect
--potato-threshold KEY=MINFail the run if mean(KEY) < MIN. Repeatable.
--potato-experiment DATASETRecord the run as an Experiment.
--potato-no-syncSkip experiment recording.

If a threshold is violated the run exits non-zero (failing the CI job) and prints THRESHOLD FAILED: <key> = <actual> < <min>. Recorded experiments are plain files ($POTATO_EVAL_STORE, default ./eval_store) you can upload as a CI artifact.

GitHub Actions

A ready-to-copy workflow lives at examples/agent-traces/ci-eval/ci_workflow_example.yml — it runs the suite on every PR, gates on thresholds, and uploads the experiment records as an artifact.