CI Evaluation
Run Potato evaluations inside your own pytest suite and gate CI on score thresholds, so a prompt or model change that regresses agent quality fails the build like a unit test. Includes a fluent expect() API and a ready-to-copy GitHub Actions workflow.
Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of programmatic evaluators and datasets & experiments.
Install
The plugin ships with Potato and auto-loads once installed:
pip install -e . # registers the `potato_eval` pytest pluginWithout installing, load it explicitly: pytest -p potato.testing.pytest_plugin.
Write eval tests
Mark a test @pytest.mark.potato_eval and request the potato_eval fixture:
import pytest
from potato.testing import expect
@pytest.mark.potato_eval
@pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES])
def test_agent(case, potato_eval):
out = my_agent(case["q"])
potato_eval.log_inputs({"question": case["q"]})
potato_eval.log_outputs(out)
potato_eval.log_reference_outputs(case["expected"])
potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0)
potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value)
expect(out).to_contain(case["expected"]) # per-case hard assertionexpect(...) offers .to_equal, .to_contain, .to_be_less_than, .to_be_greater_than, .to_be_between, .to_be_close_to, plus expect.edit_distance(a, b) and expect.embedding_distance(a, b). log_feedback scores are aggregated (mean per key) across all eval tests.
Gate the build
pytest tests/eval/ \
--potato-threshold correct=0.8 \
--potato-threshold similarity=0.7 \
--potato-experiment agent-regression| Option | Effect |
|---|---|
--potato-threshold KEY=MIN | Fail the run if mean(KEY) < MIN. Repeatable. |
--potato-experiment DATASET | Record the run as an Experiment. |
--potato-no-sync | Skip experiment recording. |
If a threshold is violated the run exits non-zero (failing the CI job) and prints THRESHOLD FAILED: <key> = <actual> < <min>. Recorded experiments are plain files ($POTATO_EVAL_STORE, default ./eval_store) you can upload as a CI artifact.
GitHub Actions
A ready-to-copy workflow lives at examples/agent-traces/ci-eval/ci_workflow_example.yml — it runs the suite on every PR, gates on thresholds, and uploads the experiment records as an artifact.
Related
- Full reference on Read the Docs — all pytest options and environment variables, version-matched
- Programmatic Evaluators
- Datasets & Experiments