# CI Evaluation Source: https://www.potatoannotator.com/docs/agent-evaluation/ci-evaluation **Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds** — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of [programmatic evaluators](/docs/agent-evaluation/programmatic-evaluators) and [datasets & experiments](/docs/agent-evaluation/datasets-and-experiments). ## Install The plugin ships with Potato and auto-loads once installed: ```bash pip install -e . # registers the `potato_eval` pytest plugin ``` Without installing, load it explicitly: `pytest -p potato.testing.pytest_plugin`. ## Write eval tests Mark a test `@pytest.mark.potato_eval` and request the `potato_eval` fixture: ```python import pytest from potato.testing import expect @pytest.mark.potato_eval @pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES]) def test_agent(case, potato_eval): out = my_agent(case["q"]) potato_eval.log_inputs({"question": case["q"]}) potato_eval.log_outputs(out) potato_eval.log_reference_outputs(case["expected"]) potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0) potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value) expect(out).to_contain(case["expected"]) # per-case hard assertion ``` `expect(...)` offers `.to_equal`, `.to_contain`, `.to_be_less_than`, `.to_be_greater_than`, `.to_be_between`, `.to_be_close_to`, plus `expect.edit_distance(a, b)` and `expect.embedding_distance(a, b)`. `log_feedback` scores are aggregated (mean per key) across all eval tests. ## Gate the build ```bash pytest tests/eval/ \ --potato-threshold correct=0.8 \ --potato-threshold similarity=0.7 \ --potato-experiment agent-regression ``` | Option | Effect | |--------|--------| | `--potato-threshold KEY=MIN` | Fail the run if `mean(KEY) < MIN`. Repeatable. | | `--potato-experiment DATASET` | Record the run as an [Experiment](/docs/agent-evaluation/datasets-and-experiments). | | `--potato-no-sync` | Skip experiment recording. | If a threshold is violated the run exits non-zero (failing the CI job) and prints `THRESHOLD FAILED: = < `. Recorded experiments are plain files (`$POTATO_EVAL_STORE`, default `./eval_store`) you can upload as a CI artifact. ## GitHub Actions A ready-to-copy workflow lives at `examples/agent-traces/ci-eval/ci_workflow_example.yml` — it runs the suite on every PR, gates on thresholds, and uploads the experiment records as an artifact. ## Related - [Full reference on Read the Docs](https://potatoannotator.readthedocs.io/en/latest/agent-evaluation/ci_evaluation/) — all pytest options and environment variables, version-matched - [Programmatic Evaluators](/docs/agent-evaluation/programmatic-evaluators) - [Datasets & Experiments](/docs/agent-evaluation/datasets-and-experiments)