Potato 2.3: Agentic Annotation, Solo Mode, and the Future of Human Evaluation
Potato 2.3.0 introduces agentic annotation with 12 trace format converters, Solo Mode for human-LLM collaborative labeling, Best-Worst Scaling, SSO/OAuth, Parquet export, and 15 demo projects.
Potato 2.3: Agentic Annotation, Solo Mode, and the Future of Human Evaluation
We are excited to announce Potato 2.3.0, the largest release in Potato's history. This update introduces two major new systems -- agentic annotation and Solo Mode -- alongside Best-Worst Scaling, SSO/OAuth authentication, Parquet export, and 15 new demo projects.
The theme of this release is simple: the things we need to annotate have changed, and our tools need to keep up. Researchers are no longer just labeling text sentiment and named entities. They are evaluating multi-step AI agent traces, comparing LLM outputs at scale, and building datasets for increasingly complex tasks. Potato 2.3 is built for this new reality.
Agentic Annotation
The headline feature of Potato 2.3 is a complete system for evaluating AI agents through human annotation.
AI agents -- systems that take multi-step actions to accomplish tasks -- are proliferating rapidly. But evaluating them is hard. A single agent run might involve dozens of tool calls, reasoning steps, web page navigations, and intermediate outputs. Existing annotation tools show agents' outputs as flat text, losing the rich structure that evaluators need to see.
Potato's agentic annotation system solves this with three components.
12 Trace Format Converters
Agent traces come in different formats depending on the framework. Potato normalizes them all into a unified representation:
| Converter | Source |
|---|---|
openai | OpenAI Assistants API / function calling |
anthropic | Anthropic Claude tool_use / Messages API |
swebench | SWE-bench task traces |
opentelemetry | OpenTelemetry span exports |
mcp | Model Context Protocol sessions |
multi_agent | CrewAI / AutoGen / LangGraph |
langchain | LangChain callback traces |
langfuse | LangFuse observation exports |
react | ReAct Thought/Action/Observation |
webarena | WebArena / VisualWebArena |
atif | Agent Trace Interchange Format |
raw_web | Raw browser recordings (HAR + screenshots) |
Configuration is straightforward:
agentic:
enabled: true
trace_converter: react
trace_file: "data/agent_traces.jsonl"Auto-detection is available for projects that need to ingest traces from multiple sources:
agentic:
enabled: true
trace_converter: autoThree Display Types
Different agent modalities need different visualizations.
Agent Trace Display renders tool-using agent traces as color-coded step cards with collapsible observations, JSON pretty-printing, and a timeline sidebar:
agentic:
display_type: agent_trace
agent_trace_display:
colors:
thought: "#6E56CF"
action: "#3b82f6"
observation: "#22c55e"
error: "#ef4444"
collapse_observations: true
show_step_numbers: trueWeb Agent Trace Display renders browsing agent traces with full screenshots, SVG overlays showing click targets and input fields, and a filmstrip for quick navigation:
agentic:
display_type: web_agent
web_agent_display:
screenshot_max_width: 900
overlay:
enabled: true
click_marker: "circle"
click_color: "#ef4444"
filmstrip:
enabled: trueInteractive Chat Display supports both trace review (evaluating a recorded conversation) and live chat (annotators interact with an agent in real time, then evaluate the conversation):
agentic:
display_type: interactive_chat
interactive_chat_display:
mode: trace_review
trace_review:
show_token_counts: true
show_latency: truePer-Turn Ratings
For any display type, annotators can rate individual steps alongside the overall trace:
annotation_schemes:
- annotation_type: likert
name: overall_quality
min: 1
max: 5
- annotation_type: per_turn_rating
name: step_correctness
target: agentic_steps
rating_type: radio
labels:
- "Correct"
- "Partially Correct"
- "Incorrect"Pre-Built Schemas
Nine annotation schemas cover common agent evaluation dimensions out of the box:
annotation_schemes:
- preset: agent_task_success
- preset: agent_step_correctness
- preset: agent_error_taxonomy
- preset: agent_safetyAvailable presets: agent_task_success, agent_step_correctness, agent_error_taxonomy, agent_safety, agent_efficiency, agent_instruction_following, agent_explanation_quality, agent_web_action_correctness, agent_conversation_quality.
Read the Agentic Annotation documentation →
Solo Mode
The second major feature of Potato 2.3 is Solo Mode: a 12-phase workflow that replaces the traditional multi-annotator paradigm with a single human expert collaborating with an LLM.
The Problem
Traditional annotation requires multiple annotators for reliability. But hiring, training, and coordinating a team is expensive and slow. For many research projects, the annotation bottleneck is not the interface -- it is the logistics.
The Solution
Solo Mode lets one domain expert label a strategically selected subset of the data. An LLM learns from those labels, proposes labels for the remaining instances, and the human reviews only the cases where the LLM struggles. A 12-phase workflow orchestrates this automatically.
In internal benchmarks, Solo Mode achieved 95%+ agreement with full multi-annotator pipelines while requiring only 10-15% of the total human labels.
The 12 Phases
- Seed Annotation -- human labels 50 diverse instances
- Initial LLM Calibration -- LLM labels a calibration batch using seed examples
- Confusion Analysis -- identify systematic human-LLM disagreement patterns
- Guideline Refinement -- LLM proposes improved guidelines; human approves
- Labeling Function Generation -- ALCHEmist-inspired programmatic rules for easy instances
- Active Labeling -- human labels the most informative remaining instances
- Automated Refinement Loop -- iterative re-labeling with updated guidelines
- Disagreement Exploration -- human resolves cases where LLM and labeling functions conflict
- Edge Case Synthesis -- LLM generates synthetic ambiguous examples for human labeling
- Cascaded Confidence Escalation -- human reviews lowest-confidence LLM labels
- Prompt Optimization -- DSPy-inspired automated prompt search
- Final Validation -- random sample review; pass or cycle back
Quick Start
solo_mode:
enabled: true
llm:
endpoint_type: openai
model: "gpt-4o"
api_key: ${OPENAI_API_KEY}
seed_count: 50
accuracy_threshold: 0.92
confidence_threshold: 0.85
annotation_schemes:
- annotation_type: radio
name: sentiment
labels: [Positive, Neutral, Negative]Multi-Signal Instance Prioritization
Solo Mode uses six weighted pools to select the most valuable instances for human labeling:
solo_mode:
prioritization:
pools:
- name: uncertain
weight: 0.30
- name: disagreement
weight: 0.25
- name: boundary
weight: 0.20
- name: novel
weight: 0.10
- name: error_pattern
weight: 0.10
- name: random
weight: 0.05Read the Solo Mode documentation →
Best-Worst Scaling
Potato 2.3 adds Best-Worst Scaling (BWS), also known as Maximum Difference Scaling. Annotators see a tuple of items (typically 4) and select the best and worst according to some criterion. BWS produces reliable scalar scores from simple binary judgments, requiring far fewer annotations than Likert scales for the same statistical power.
annotation_schemes:
- annotation_type: best_worst_scaling
name: fluency
description: "Select the BEST and WORST translation"
items_key: "translations"
tuple_size: 4
best_label: "Most Fluent"
worst_label: "Least Fluent"
randomize_order: true
tuple_generation:
method: balanced_incomplete
tuples_per_item: 5
scoring:
method: bradley_terry
auto_compute: true
include_confidence: trueThree scoring methods are available:
- Counting -- simple (best_count - worst_count) / appearances
- Bradley-Terry -- pairwise comparison model (recommended default)
- Plackett-Luce -- full ranking model for maximum data efficiency
Score from the CLI:
python -m potato.bws score --config config.yaml --method bradley_terry --output scores.csvThe admin dashboard includes a BWS tab with score distributions, convergence charts, and split-half reliability metrics.
Read the Best-Worst Scaling documentation →
SSO & OAuth Authentication
Production annotation deployments need proper authentication. Potato 2.3 supports three OAuth methods:
Google OAuth
authentication:
method: google_oauth
google_oauth:
client_id: ${GOOGLE_CLIENT_ID}
client_secret: ${GOOGLE_CLIENT_SECRET}
redirect_uri: "https://annotation.example.com/auth/google/callback"
allowed_domains:
- "umich.edu"
auto_register: trueGitHub OAuth with Organization Restriction
authentication:
method: github_oauth
github_oauth:
client_id: ${GITHUB_CLIENT_ID}
client_secret: ${GITHUB_CLIENT_SECRET}
redirect_uri: "https://annotation.example.com/auth/github/callback"
allowed_organizations:
- "my-research-lab"
scopes:
- "read:user"
- "read:org"Generic OIDC
Connect to Okta, Azure AD, Auth0, Keycloak, or any OIDC-compliant provider:
authentication:
method: oidc
oidc:
discovery_url: "https://accounts.example.com/.well-known/openid-configuration"
client_id: ${OIDC_CLIENT_ID}
client_secret: ${OIDC_CLIENT_SECRET}
redirect_uri: "https://annotation.example.com/auth/oidc/callback"All methods support domain restriction, auto-registration, and mixed mode (multiple auth methods on one login page).
Read the SSO & OAuth documentation →
Parquet Export
Annotation data is increasingly consumed by data science tools that expect columnar formats. Potato 2.3 can export annotations directly to Apache Parquet, producing three structured files:
- annotations.parquet -- one row per (instance, annotator, schema) with values, timestamps, and durations
- spans.parquet -- one row per annotated span with offsets, labels, and links
- items.parquet -- instance metadata with annotation counts and status
parquet_export:
enabled: true
output_dir: "output/parquet/"
compression: zstd
auto_export: trueLoad directly in pandas, DuckDB, PyArrow, Polars, or Hugging Face Datasets:
import pandas as pd
annotations = pd.read_parquet("output/parquet/annotations.parquet")
# Or with DuckDB for SQL queries
import duckdb
duckdb.sql("""
SELECT instance_id, value, COUNT(*) as n
FROM 'output/parquet/annotations.parquet'
WHERE schema_name = 'sentiment'
GROUP BY instance_id, value
""")Supports snappy, gzip, zstd, lz4, and brotli compression, incremental export with date/annotator partitioning, and dictionary encoding for string columns.
Read the Parquet Export documentation →
15 New Demo Projects
Potato 2.3 ships with 15 new demo projects in the project-hub/ directory, covering the new features:
Agentic Annotation Demos
- react-agent-eval -- Evaluate ReAct agent traces with step-level ratings
- web-agent-eval -- WebArena trace evaluation with screenshot overlays
- chatbot-eval -- Interactive chat evaluation with live agent proxy
- multi-agent-eval -- Evaluate CrewAI multi-agent systems
- swebench-eval -- SWE-bench trace evaluation for coding agents
Solo Mode Demos
- solo-sentiment -- Solo Mode sentiment classification on product reviews
- solo-ner -- Solo Mode named entity recognition
- solo-toxicity -- Solo Mode toxicity detection with edge case synthesis
Best-Worst Scaling Demos
- bws-translation -- Machine translation quality ranking
- bws-summarization -- Summary quality comparison
- bws-image-quality -- Image generation quality ranking
Authentication Demos
- google-oauth-demo -- Google OAuth setup example
- github-oauth-demo -- GitHub OAuth with org restriction
Export Demos
- parquet-export-demo -- Parquet export with DuckDB analysis script
- huggingface-upload -- Export to Parquet and push to Hugging Face Hub
Each demo includes a complete config.yaml, sample data, and a README with setup instructions. Start any demo with:
cd project-hub/react-agent-eval
potato start config.yamlSecurity Hardening
Potato 2.3 includes several security improvements:
- Session tokens use cryptographically secure random generation with configurable expiration
- CSRF protection is enabled by default for all form submissions
- Rate limiting on authentication endpoints (configurable, default 10 attempts per minute)
- Input sanitization for all user-provided content displayed in the annotation interface
- Dependency audit -- all Python and JavaScript dependencies updated to latest secure versions
- Content Security Policy headers added to prevent XSS
security:
csrf_protection: true
rate_limiting:
auth_attempts: 10 # per minute
api_requests: 100 # per minute
session:
token_length: 64
lifetime_hours: 24
content_security_policy: trueUpgrading
From Potato 2.2.x
pip install --upgrade potato-annotationAll v2.2 configurations are fully backward-compatible. No changes to existing configs are required.
New Dependencies
Parquet export requires PyArrow:
pip install potato-annotation[parquet]Solo Mode requires an LLM provider SDK:
pip install potato-annotation[solo] # installs openai + anthropic SDKsOr install everything:
pip install potato-annotation[all]What's Next
Potato 2.3 represents a significant expansion of what annotation tools can do. We are already working on the next set of features:
- Annotation diffing -- compare annotations across rounds and annotators with visual diffs
- Federated annotation -- coordinate annotation across multiple Potato instances
- Streaming data sources -- annotate data from Kafka, Pub/Sub, and other streaming systems
- Mobile-optimized interface -- responsive annotation for tablets and phones
We would love to hear your feedback. File issues on GitHub, join the discussion in GitHub Discussions, or reach out to the team directly.