Datasets & Experiments
Build versioned evaluation datasets and run experiments that score agent outputs over time. Potato's eval backbone — file or SQLite storage, tagged versions, splits, SFT/DPO export, and a side-by-side experiment comparison with regression deltas.
Datasets & Experiments are Potato's evaluation backbone: versioned collections of evaluation examples, and experiment runs that score them with programmatic evaluators. Together they turn Potato from "annotate once" into "evaluate continuously" — curate a test set, run evaluators, and track scores across prompt or model versions.
Enabling
datasets:
enabled: true
storage: file # "file" (default, git-diffable JSONL) | "sqlite"Datasets
A dataset is a named collection of examples (id, inputs, optional reference_outputs, metadata, split). Every add/update/delete creates a new immutable version (v0001, v0002, …); versions can be tagged (e.g. prod), and reads pin a version with as_of (latest, a tag, or a version id).
Curate examples three ways:
- From the live task — Import loaded instances turns the task's instances into examples.
- From ingested traces only — Import ingested traces (webhook / LangSmith / Langfuse).
- With human annotations as references — aggregate the majority human label per scheme into each example's
reference_outputs.
Experiments
An experiment runs evaluators against a dataset version and records per-example results plus aggregate scores. Pick a dataset and evaluators on the overview page and Run, or POST /datasets/api/experiments/run. Select two or more experiments and Compare to see aggregate scores side by side, with deltas vs. the baseline so regressions stand out.
LLM-judge evaluators call your configured AI endpoint and may take a while on large datasets.
Export to fine-tuning data
Any dataset version exports to JSONL for fine-tuning:
- SFT —
{"prompt": <inputs>, "completion": <reference_outputs>} - DPO —
{"prompt": <inputs>, "chosen": <reference_outputs>, "rejected": <metadata.rejected | outputs>}
curl -OJ "http://localhost:8000/datasets/api/datasets/agent-eval-v1/export?format=sft"API
All endpoints require admin auth (X-API-Key header or admin session).
| Method | Path | Purpose |
|---|---|---|
| POST | /datasets/api/datasets | Create a dataset |
| POST | /datasets/api/datasets/<name>/examples | Add examples (new version) |
| POST | /datasets/api/datasets/<name>/tag | Tag a version |
| GET | /datasets/api/datasets/<name>/export?format=sft|dpo | Export JSONL |
| POST | /datasets/api/experiments/run | Run an experiment |
| GET | /datasets/api/experiments | List experiments |
The eval-admin API (/admin/eval/...) inspects and controls the annotation process for these tasks: dataset/experiment status, per-instance annotation progress, ingested-trace counts, and assignment pause/resume.
Related
- Full reference on Read the Docs — full API and storage details, version-matched
- Programmatic Evaluators
- Automation Rules — auto-curate incoming traces into datasets
- Semantic Curation — find traces to add by similarity