Potato 2.3: Agentic Annotation, Solo Mode, and the Future of Human Evaluation
Potato 2.3.0 introduces agentic annotation with 12 trace format converters, Solo Mode for human-LLM collaborative labeling, Best-Worst Scaling, SSO/OAuth, Parquet export, and 15 demo projects.
Note: This post describes Potato 2.3 as it was at release. Some configuration keys and features have been updated in later versions. See the current documentation for up-to-date configuration syntax.
Potato 2.3.0 is our largest release so far. It brings two new systems, agentic annotation and Solo Mode, plus Best-Worst Scaling, SSO/OAuth authentication, Parquet export, and 15 new demo projects.
The reason for most of this is that what people annotate has changed. A lot of researchers are no longer just labeling sentiment and named entities. They are picking apart multi-step agent traces, comparing LLM outputs in bulk, and building datasets for tasks that did not exist a couple of years ago. Potato 2.3 is aimed squarely at that work.
Agentic annotation
The headline feature in 2.3 is a full system for evaluating AI agents through human annotation.
Agents, meaning systems that take multiple steps to get something done, are everywhere now, and they are genuinely hard to evaluate. One run might rack up dozens of tool calls, reasoning steps, page navigations, and intermediate outputs. Most annotation tools flatten all of that into plain text, which throws away exactly the structure an evaluator needs to see.
Potato's agentic annotation system has three parts.
12 trace format converters
Agent traces look different depending on the framework that produced them. Potato normalizes them into one representation:
| Converter | Source |
|---|---|
openai | OpenAI Assistants API / function calling |
anthropic | Anthropic Claude tool_use / Messages API |
swebench | SWE-bench task traces |
opentelemetry | OpenTelemetry span exports |
mcp | Model Context Protocol sessions |
multi_agent | CrewAI / AutoGen / LangGraph |
langchain | LangChain callback traces |
langfuse | LangFuse observation exports |
react | ReAct Thought/Action/Observation |
webarena | WebArena / VisualWebArena |
atif | Agent Trace Interchange Format |
raw_web | Raw browser recordings (HAR + screenshots) |
The config is short:
agentic:
enabled: true
trace_converter: react
trace_file: "data/agent_traces.jsonl"If you are pulling traces from several sources at once, let it auto-detect:
agentic:
enabled: true
trace_converter: autoThree display types
Different kinds of agents call for different visualizations.
Agent Trace Display renders tool-using agent traces as color-coded step cards, with collapsible observations, pretty-printed JSON, and a timeline sidebar:
agentic:
display_type: agent_trace
agent_trace_display:
colors:
thought: "#6E56CF"
action: "#3b82f6"
observation: "#22c55e"
error: "#ef4444"
collapse_observations: true
show_step_numbers: trueWeb Agent Trace Display handles browsing agents: full screenshots, SVG overlays marking click targets and input fields, and a filmstrip for jumping around quickly:
agentic:
display_type: web_agent
web_agent_display:
screenshot_max_width: 900
overlay:
enabled: true
click_marker: "circle"
click_color: "#ef4444"
filmstrip:
enabled: trueInteractive Chat Display covers two cases: trace review, where you evaluate a recorded conversation, and live chat, where annotators talk to an agent in real time and then rate the conversation:
agentic:
display_type: interactive_chat
interactive_chat_display:
mode: trace_review
trace_review:
show_token_counts: true
show_latency: truePer-turn ratings
With any display type, annotators can rate individual steps as well as the trace overall:
annotation_schemes:
- annotation_type: likert
name: overall_quality
min: 1
max: 5
- annotation_type: per_turn_rating
name: step_correctness
target: agentic_steps
rating_type: radio
labels:
- "Correct"
- "Partially Correct"
- "Incorrect"Pre-built schemas
Nine ready-made schemas cover the usual agent evaluation dimensions out of the box:
annotation_schemes:
- preset: agent_task_success
- preset: agent_step_correctness
- preset: agent_error_taxonomy
- preset: agent_safetyThe presets are agent_task_success, agent_step_correctness, agent_error_taxonomy, agent_safety, agent_efficiency, agent_instruction_following, agent_explanation_quality, agent_web_action_correctness, and agent_conversation_quality.
Read the agentic annotation documentation →
Solo Mode
The second big feature in 2.3 is Solo Mode, a 12-phase workflow that swaps the usual crowd of annotators for one human expert working alongside an LLM.
The problem
You normally need several annotators to get reliable labels. Hiring, training, and coordinating that team is slow and expensive. On a lot of research projects the bottleneck is not the annotation interface at all, it is the logistics of running a team.
How Solo Mode handles it
One domain expert labels a carefully chosen slice of the data. An LLM learns from those labels, proposes labels for everything else, and the human only steps back in where the LLM is unsure. A 12-phase workflow runs the whole loop.
In our internal benchmarks, Solo Mode matched full multi-annotator pipelines at 95% agreement or better, using only 10 to 15% of the human labels.
The 12 phases
- Seed annotation: the human labels 50 diverse instances.
- Initial LLM calibration: the LLM labels a calibration batch using those seed examples.
- Confusion analysis: find the patterns where human and LLM systematically disagree.
- Guideline refinement: the LLM proposes better guidelines and the human approves them.
- Labeling function generation: ALCHEmist-inspired programmatic rules for the easy instances.
- Active labeling: the human labels the most informative instances that remain.
- Automated refinement loop: re-label iteratively as the guidelines improve.
- Disagreement exploration: the human resolves cases where the LLM and the labeling functions clash.
- Edge case synthesis: the LLM invents ambiguous examples for the human to label.
- Cascaded confidence escalation: the human reviews the LLM's lowest-confidence labels.
- Prompt optimization: a DSPy-inspired automated prompt search.
- Final validation: review a random sample, then pass or cycle back.
Quick start
solo_mode:
enabled: true
llm:
endpoint_type: openai
model: "gpt-4o"
api_key: ${OPENAI_API_KEY}
seed_count: 50
accuracy_threshold: 0.92
confidence_threshold: 0.85
annotation_schemes:
- annotation_type: radio
name: sentiment
labels: [Positive, Neutral, Negative]Multi-signal instance prioritization
Solo Mode draws from six weighted pools to decide which instances are worth a human's time:
solo_mode:
prioritization:
pools:
- name: uncertain
weight: 0.30
- name: disagreement
weight: 0.25
- name: boundary
weight: 0.20
- name: novel
weight: 0.10
- name: error_pattern
weight: 0.10
- name: random
weight: 0.05Read the Solo Mode documentation →
Best-Worst Scaling
Potato 2.3 adds Best-Worst Scaling (BWS), sometimes called Maximum Difference Scaling. Annotators see a tuple of items, usually four, and pick the best and the worst against some criterion. From those simple binary judgments BWS works out reliable scalar scores, and it gets there with far fewer annotations than a Likert scale would need for the same statistical power.
annotation_schemes:
- annotation_type: best_worst_scaling
name: fluency
description: "Select the BEST and WORST translation"
items_key: "translations"
tuple_size: 4
best_label: "Most Fluent"
worst_label: "Least Fluent"
randomize_order: true
tuple_generation:
method: balanced_incomplete
tuples_per_item: 5
scoring:
method: bradley_terry
auto_compute: true
include_confidence: trueThere are three scoring methods. Counting is the simple one: (best_count - worst_count) / appearances. Bradley-Terry is a pairwise comparison model and the recommended default. Plackett-Luce is a full ranking model when you want to squeeze the most out of your data.
Score from the CLI:
python -m potato.bws score --config config.yaml --method bradley_terry --output scores.csvThe admin dashboard has a BWS tab showing score distributions, convergence charts, and split-half reliability.
Read the Best-Worst Scaling documentation →
SSO and OAuth authentication
A production annotation deployment needs real authentication. Potato 2.3 supports three OAuth methods.
Google OAuth
authentication:
method: google_oauth
google_oauth:
client_id: ${GOOGLE_CLIENT_ID}
client_secret: ${GOOGLE_CLIENT_SECRET}
redirect_uri: "https://annotation.example.com/auth/google/callback"
allowed_domains:
- "umich.edu"
auto_register: trueGitHub OAuth with Organization Restriction
authentication:
method: github_oauth
github_oauth:
client_id: ${GITHUB_CLIENT_ID}
client_secret: ${GITHUB_CLIENT_SECRET}
redirect_uri: "https://annotation.example.com/auth/github/callback"
allowed_organizations:
- "my-research-lab"
scopes:
- "read:user"
- "read:org"Generic OIDC
Connect to Okta, Azure AD, Auth0, Keycloak, or anything else that speaks OIDC:
authentication:
method: oidc
oidc:
discovery_url: "https://accounts.example.com/.well-known/openid-configuration"
client_id: ${OIDC_CLIENT_ID}
client_secret: ${OIDC_CLIENT_SECRET}
redirect_uri: "https://annotation.example.com/auth/oidc/callback"All three support domain restriction, auto-registration, and mixed mode, where several auth methods share one login page.
Read the SSO and OAuth documentation →
Parquet export
More and more, annotation data ends up in data science tools that want columnar formats. Potato 2.3 can export straight to Apache Parquet as three files:
annotations.parquet, one row per (instance, annotator, schema) with values, timestamps, and durationsspans.parquet, one row per annotated span with offsets, labels, and linksitems.parquet, instance metadata with annotation counts and status
parquet_export:
enabled: true
output_dir: "output/parquet/"
compression: zstd
auto_export: trueLoad directly in pandas, DuckDB, PyArrow, Polars, or Hugging Face Datasets:
import pandas as pd
annotations = pd.read_parquet("output/parquet/annotations.parquet")
# Or with DuckDB for SQL queries
import duckdb
duckdb.sql("""
SELECT instance_id, value, COUNT(*) as n
FROM 'output/parquet/annotations.parquet'
WHERE schema_name = 'sentiment'
GROUP BY instance_id, value
""")It handles snappy, gzip, zstd, lz4, and brotli compression, incremental export partitioned by date or annotator, and dictionary encoding for string columns.
Read the Parquet export documentation →
15 new demo projects
Potato 2.3 ships 15 new demos in the project-hub/ directory, one or more for each new feature.
For agentic annotation:
react-agent-eval, evaluating ReAct agent traces with step-level ratingsweb-agent-eval, WebArena trace evaluation with screenshot overlayschatbot-eval, interactive chat evaluation with a live agent proxymulti-agent-eval, evaluating CrewAI multi-agent systemsswebench-eval, SWE-bench trace evaluation for coding agents
For Solo Mode:
6. solo-sentiment, sentiment classification on product reviews
7. solo-ner, named entity recognition
8. solo-toxicity, toxicity detection with edge case synthesis
For Best-Worst Scaling:
9. bws-translation, machine translation quality ranking
10. bws-summarization, summary quality comparison
11. bws-image-quality, image generation quality ranking
For authentication:
12. google-oauth-demo, a Google OAuth setup example
13. github-oauth-demo, GitHub OAuth with org restriction
For export:
14. parquet-export-demo, Parquet export with a DuckDB analysis script
15. huggingface-upload, export to Parquet and push to the Hugging Face Hub
Each demo comes with a full config.yaml, sample data, and a README. Start any of them with:
cd project-hub/react-agent-eval
potato start config.yamlSecurity hardening
A handful of security improvements landed in 2.3:
- Session tokens now use cryptographically secure random generation with configurable expiration
- CSRF protection is on by default for every form submission
- Rate limiting on the authentication endpoints (configurable, 10 attempts per minute by default)
- Input sanitization for any user-provided content shown in the annotation interface
- A dependency audit that brought all Python and JavaScript dependencies up to current secure versions
- Content Security Policy headers to head off XSS
security:
csrf_protection: true
rate_limiting:
auth_attempts: 10 # per minute
api_requests: 100 # per minute
session:
token_length: 64
lifetime_hours: 24
content_security_policy: trueUpgrading
From Potato 2.2.x
pip install --upgrade potato-annotationYour v2.2 configs are fully backward-compatible, so nothing needs to change.
New dependencies
Parquet export needs PyArrow:
pip install potato-annotation[parquet]Solo Mode requires an LLM provider SDK:
pip install potato-annotation[solo] # installs openai + anthropic SDKsOr install everything:
pip install potato-annotation[all]What's next
A few things we are already working on for the next release:
- Annotation diffing, to compare annotations across rounds and annotators with visual diffs
- Federated annotation, to coordinate work across multiple Potato instances
- Streaming data sources, to annotate from Kafka, Pub/Sub, and similar systems
- A mobile-friendly interface for annotating on tablets and phones
We would genuinely like to hear what you think. File issues on GitHub, start a thread in GitHub Discussions, or just reach out to the team.
For the full changelog, including any config keys that changed, see the v2.3.0 release notes in the repository.