このページはまだお使いの言語に翻訳されていません。英語版を表示しています。

Datasets & Experiments

Build versioned evaluation datasets and run experiments that score agent outputs over time. Potato's eval backbone — file or SQLite storage, tagged versions, splits, SFT/DPO export, and a side-by-side experiment comparison with regression deltas.

Datasets & Experiments are Potato's evaluation backbone: versioned collections of evaluation examples, and experiment runs that score them with programmatic evaluators. Together they turn Potato from "annotate once" into "evaluate continuously" — curate a test set, run evaluators, and track scores across prompt or model versions.

Enabling

yaml

datasets:
  enabled: true
  storage: file   # "file" (default, git-diffable JSONL) | "sqlite"

Datasets

A dataset is a named collection of examples (id, inputs, optional reference_outputs, metadata, split). Every add/update/delete creates a new immutable version (v0001, v0002, …); versions can be tagged (e.g. prod), and reads pin a version with as_of (latest, a tag, or a version id).

Curate examples three ways:

From the live task — Import loaded instances turns the task's instances into examples.
From ingested traces only — Import ingested traces (webhook / LangSmith / Langfuse).
With human annotations as references — aggregate the majority human label per scheme into each example's reference_outputs.

Experiments

An experiment runs evaluators against a dataset version and records per-example results plus aggregate scores. Pick a dataset and evaluators on the overview page and Run, or POST /datasets/api/experiments/run. Select two or more experiments and Compare to see aggregate scores side by side, with deltas vs. the baseline so regressions stand out.

LLM-judge evaluators call your configured AI endpoint and may take a while on large datasets.

Export to fine-tuning data

Any dataset version exports to JSONL for fine-tuning:

SFT — {"prompt": <inputs>, "completion": <reference_outputs>}
DPO — {"prompt": <inputs>, "chosen": <reference_outputs>, "rejected": <metadata.rejected | outputs>}

bash

curl -OJ "http://localhost:8000/datasets/api/datasets/agent-eval-v1/export?format=sft"

API

All endpoints require admin auth (X-API-Key header or admin session).

Method	Path	Purpose
POST	`/datasets/api/datasets`	Create a dataset
POST	`/datasets/api/datasets/<name>/examples`	Add examples (new version)
POST	`/datasets/api/datasets/<name>/tag`	Tag a version
GET	`/datasets/api/datasets/<name>/export?format=sft\|dpo`	Export JSONL
POST	`/datasets/api/experiments/run`	Run an experiment
GET	`/datasets/api/experiments`	List experiments

The eval-admin API (/admin/eval/...) inspects and controls the annotation process for these tasks: dataset/experiment status, per-instance annotation progress, ingested-trace counts, and assignment pause/resume.

Full reference on Read the Docs — full API and storage details, version-matched
Programmatic Evaluators
Automation Rules — auto-curate incoming traces into datasets
Semantic Curation — find traces to add by similarity

Datasets & Experiments

Enabling

Datasets

Experiments

Export to fine-tuning data

API

Related