# Datasets & Experiments

Source: https://www.potatoannotator.com/docs/agent-evaluation/datasets-and-experiments

**Datasets & Experiments are Potato's evaluation backbone: versioned collections of evaluation examples, and experiment runs that score them with [programmatic evaluators](/docs/agent-evaluation/programmatic-evaluators).** Together they turn Potato from "annotate once" into "evaluate continuously" — curate a test set, run evaluators, and track scores across prompt or model versions.

## Enabling

```yaml
datasets:
  enabled: true
  storage: file   # "file" (default, git-diffable JSONL) | "sqlite"
```

## Datasets

A dataset is a named collection of **examples** (`id`, `inputs`, optional `reference_outputs`, `metadata`, `split`). Every add/update/delete creates a new immutable **version** (`v0001`, `v0002`, …); versions can be **tagged** (e.g. `prod`), and reads pin a version with `as_of` (`latest`, a tag, or a version id).

Curate examples three ways:

- **From the live task** — *Import loaded instances* turns the task's instances into examples.
- **From ingested traces only** — *Import ingested traces* (webhook / LangSmith / Langfuse).
- **With human annotations as references** — aggregate the majority human label per scheme into each example's `reference_outputs`.

## Experiments

An experiment runs evaluators against a dataset version and records per-example results plus aggregate scores. Pick a dataset and evaluators on the overview page and **Run**, or `POST /datasets/api/experiments/run`. Select two or more experiments and **Compare** to see aggregate scores side by side, with deltas vs. the baseline so regressions stand out.

> LLM-judge evaluators call your configured AI endpoint and may take a while on large datasets.

## Export to fine-tuning data

Any dataset version exports to JSONL for fine-tuning:

- **SFT** — `{"prompt": <inputs>, "completion": <reference_outputs>}`
- **DPO** — `{"prompt": <inputs>, "chosen": <reference_outputs>, "rejected": <metadata.rejected | outputs>}`

```bash
curl -OJ "http://localhost:8000/datasets/api/datasets/agent-eval-v1/export?format=sft"
```

## API

All endpoints require admin auth (`X-API-Key` header or admin session).

| Method | Path | Purpose |
|--------|------|---------|
| POST | `/datasets/api/datasets` | Create a dataset |
| POST | `/datasets/api/datasets/<name>/examples` | Add examples (new version) |
| POST | `/datasets/api/datasets/<name>/tag` | Tag a version |
| GET | `/datasets/api/datasets/<name>/export?format=sft\|dpo` | Export JSONL |
| POST | `/datasets/api/experiments/run` | Run an experiment |
| GET | `/datasets/api/experiments` | List experiments |

The eval-admin API (`/admin/eval/...`) inspects and controls the annotation process for these tasks: dataset/experiment status, per-instance annotation progress, ingested-trace counts, and assignment pause/resume.

## Related

- [Full reference on Read the Docs](https://potatoannotator.readthedocs.io/en/latest/agent-evaluation/datasets_and_experiments/) — full API and storage details, version-matched
- [Programmatic Evaluators](/docs/agent-evaluation/programmatic-evaluators)
- [Automation Rules](/docs/agent-evaluation/automation-rules) — auto-curate incoming traces into datasets
- [Semantic Curation](/docs/agent-evaluation/semantic-curation) — find traces to add by similarity
