Skip to content
このページはまだお使いの言語に翻訳されていません。英語版を表示しています。

Datasets & Experiments

Build versioned evaluation datasets and run experiments that score agent outputs over time. Potato's eval backbone — file or SQLite storage, tagged versions, splits, SFT/DPO export, and a side-by-side experiment comparison with regression deltas.

Datasets & Experiments are Potato's evaluation backbone: versioned collections of evaluation examples, and experiment runs that score them with programmatic evaluators. Together they turn Potato from "annotate once" into "evaluate continuously" — curate a test set, run evaluators, and track scores across prompt or model versions.

Enabling

yaml
datasets:
  enabled: true
  storage: file   # "file" (default, git-diffable JSONL) | "sqlite"

Datasets

A dataset is a named collection of examples (id, inputs, optional reference_outputs, metadata, split). Every add/update/delete creates a new immutable version (v0001, v0002, …); versions can be tagged (e.g. prod), and reads pin a version with as_of (latest, a tag, or a version id).

Curate examples three ways:

  • From the live taskImport loaded instances turns the task's instances into examples.
  • From ingested traces onlyImport ingested traces (webhook / LangSmith / Langfuse).
  • With human annotations as references — aggregate the majority human label per scheme into each example's reference_outputs.

Experiments

An experiment runs evaluators against a dataset version and records per-example results plus aggregate scores. Pick a dataset and evaluators on the overview page and Run, or POST /datasets/api/experiments/run. Select two or more experiments and Compare to see aggregate scores side by side, with deltas vs. the baseline so regressions stand out.

LLM-judge evaluators call your configured AI endpoint and may take a while on large datasets.

Export to fine-tuning data

Any dataset version exports to JSONL for fine-tuning:

  • SFT{"prompt": <inputs>, "completion": <reference_outputs>}
  • DPO{"prompt": <inputs>, "chosen": <reference_outputs>, "rejected": <metadata.rejected | outputs>}
bash
curl -OJ "http://localhost:8000/datasets/api/datasets/agent-eval-v1/export?format=sft"

API

All endpoints require admin auth (X-API-Key header or admin session).

MethodPathPurpose
POST/datasets/api/datasetsCreate a dataset
POST/datasets/api/datasets/<name>/examplesAdd examples (new version)
POST/datasets/api/datasets/<name>/tagTag a version
GET/datasets/api/datasets/<name>/export?format=sft|dpoExport JSONL
POST/datasets/api/experiments/runRun an experiment
GET/datasets/api/experimentsList experiments

The eval-admin API (/admin/eval/...) inspects and controls the annotation process for these tasks: dataset/experiment status, per-instance annotation progress, ingested-trace counts, and assignment pause/resume.