# CI Evaluation

Source: https://www.potatoannotator.com/docs/agent-evaluation/ci-evaluation

**Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds** — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of [programmatic evaluators](/docs/agent-evaluation/programmatic-evaluators) and [datasets & experiments](/docs/agent-evaluation/datasets-and-experiments).

## Install

The plugin ships with Potato and auto-loads once installed:

```bash
pip install -e .          # registers the `potato_eval` pytest plugin
```

Without installing, load it explicitly: `pytest -p potato.testing.pytest_plugin`.

## Write eval tests

Mark a test `@pytest.mark.potato_eval` and request the `potato_eval` fixture:

```python
import pytest
from potato.testing import expect

@pytest.mark.potato_eval
@pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES])
def test_agent(case, potato_eval):
    out = my_agent(case["q"])
    potato_eval.log_inputs({"question": case["q"]})
    potato_eval.log_outputs(out)
    potato_eval.log_reference_outputs(case["expected"])

    potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0)
    potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value)

    expect(out).to_contain(case["expected"])   # per-case hard assertion
```

`expect(...)` offers `.to_equal`, `.to_contain`, `.to_be_less_than`, `.to_be_greater_than`, `.to_be_between`, `.to_be_close_to`, plus `expect.edit_distance(a, b)` and `expect.embedding_distance(a, b)`. `log_feedback` scores are aggregated (mean per key) across all eval tests.

## Gate the build

```bash
pytest tests/eval/ \
  --potato-threshold correct=0.8 \
  --potato-threshold similarity=0.7 \
  --potato-experiment agent-regression
```

| Option | Effect |
|--------|--------|
| `--potato-threshold KEY=MIN` | Fail the run if `mean(KEY) < MIN`. Repeatable. |
| `--potato-experiment DATASET` | Record the run as an [Experiment](/docs/agent-evaluation/datasets-and-experiments). |
| `--potato-no-sync` | Skip experiment recording. |

If a threshold is violated the run exits non-zero (failing the CI job) and prints `THRESHOLD FAILED: <key> = <actual> < <min>`. Recorded experiments are plain files (`$POTATO_EVAL_STORE`, default `./eval_store`) you can upload as a CI artifact.

## GitHub Actions

A ready-to-copy workflow lives at `examples/agent-traces/ci-eval/ci_workflow_example.yml` — it runs the suite on every PR, gates on thresholds, and uploads the experiment records as an artifact.

## Related

- [Full reference on Read the Docs](https://potatoannotator.readthedocs.io/en/latest/agent-evaluation/ci_evaluation/) — all pytest options and environment variables, version-matched
- [Programmatic Evaluators](/docs/agent-evaluation/programmatic-evaluators)
- [Datasets & Experiments](/docs/agent-evaluation/datasets-and-experiments)
