यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

CI Evaluation

Run Potato evaluations inside your own pytest suite and gate CI on score thresholds, so a prompt or model change that regresses agent quality fails the build like a unit test. Includes a fluent expect() API and a ready-to-copy GitHub Actions workflow.

Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of programmatic evaluators and datasets & experiments.

Install

The plugin ships with Potato and auto-loads once installed:

bash

pip install -e .          # registers the `potato_eval` pytest plugin

Without installing, load it explicitly: pytest -p potato.testing.pytest_plugin.

Write eval tests

Mark a test @pytest.mark.potato_eval and request the potato_eval fixture:

python

import pytest
from potato.testing import expect
 
@pytest.mark.potato_eval
@pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES])
def test_agent(case, potato_eval):
    out = my_agent(case["q"])
    potato_eval.log_inputs({"question": case["q"]})
    potato_eval.log_outputs(out)
    potato_eval.log_reference_outputs(case["expected"])
 
    potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0)
    potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value)
 
    expect(out).to_contain(case["expected"])   # per-case hard assertion

expect(...) offers .to_equal, .to_contain, .to_be_less_than, .to_be_greater_than, .to_be_between, .to_be_close_to, plus expect.edit_distance(a, b) and expect.embedding_distance(a, b). log_feedback scores are aggregated (mean per key) across all eval tests.

Gate the build

bash

pytest tests/eval/ \
  --potato-threshold correct=0.8 \
  --potato-threshold similarity=0.7 \
  --potato-experiment agent-regression

Option	Effect
`--potato-threshold KEY=MIN`	Fail the run if `mean(KEY) < MIN`. Repeatable.
`--potato-experiment DATASET`	Record the run as an Experiment.
`--potato-no-sync`	Skip experiment recording.

If a threshold is violated the run exits non-zero (failing the CI job) and prints THRESHOLD FAILED: <key> = <actual> < <min>. Recorded experiments are plain files ($POTATO_EVAL_STORE, default ./eval_store) you can upload as a CI artifact.

GitHub Actions

A ready-to-copy workflow lives at examples/agent-traces/ci-eval/ci_workflow_example.yml — it runs the suite on every PR, gates on thresholds, and uploads the experiment records as an artifact.

Full reference on Read the Docs — all pytest options and environment variables, version-matched
Programmatic Evaluators
Datasets & Experiments

CI Evaluation

Install

Write eval tests

Gate the build

GitHub Actions

Related