Skip to content
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

CI Evaluation

Run Potato evaluations inside your own pytest suite and gate CI on score thresholds, so a prompt or model change that regresses agent quality fails the build like a unit test. Includes a fluent expect() API and a ready-to-copy GitHub Actions workflow.

Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of programmatic evaluators and datasets & experiments.

Install

The plugin ships with Potato and auto-loads once installed:

bash
pip install -e .          # registers the `potato_eval` pytest plugin

Without installing, load it explicitly: pytest -p potato.testing.pytest_plugin.

Write eval tests

Mark a test @pytest.mark.potato_eval and request the potato_eval fixture:

python
import pytest
from potato.testing import expect
 
@pytest.mark.potato_eval
@pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES])
def test_agent(case, potato_eval):
    out = my_agent(case["q"])
    potato_eval.log_inputs({"question": case["q"]})
    potato_eval.log_outputs(out)
    potato_eval.log_reference_outputs(case["expected"])
 
    potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0)
    potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value)
 
    expect(out).to_contain(case["expected"])   # per-case hard assertion

expect(...) offers .to_equal, .to_contain, .to_be_less_than, .to_be_greater_than, .to_be_between, .to_be_close_to, plus expect.edit_distance(a, b) and expect.embedding_distance(a, b). log_feedback scores are aggregated (mean per key) across all eval tests.

Gate the build

bash
pytest tests/eval/ \
  --potato-threshold correct=0.8 \
  --potato-threshold similarity=0.7 \
  --potato-experiment agent-regression
OptionEffect
--potato-threshold KEY=MINFail the run if mean(KEY) < MIN. Repeatable.
--potato-experiment DATASETRecord the run as an Experiment.
--potato-no-syncSkip experiment recording.

If a threshold is violated the run exits non-zero (failing the CI job) and prints THRESHOLD FAILED: <key> = <actual> < <min>. Recorded experiments are plain files ($POTATO_EVAL_STORE, default ./eval_store) you can upload as a CI artifact.

GitHub Actions

A ready-to-copy workflow lives at examples/agent-traces/ci-eval/ci_workflow_example.yml — it runs the suite on every PR, gates on thresholds, and uploads the experiment records as an artifact.