Note: This post describes Potato 2.3 as it was at release. Some configuration keys and features have been updated in later versions. See the current documentation for up-to-date configuration syntax.

Potato 2.3.0 is our largest release so far. It brings two new systems, agentic annotation and Solo Mode, plus Best-Worst Scaling, SSO/OAuth authentication, Parquet export, and 15 new demo projects.

The reason for most of this is that what people annotate has changed. A lot of researchers are no longer just labeling sentiment and named entities. They are picking apart multi-step agent traces, comparing LLM outputs in bulk, and building datasets for tasks that did not exist a couple of years ago. Potato 2.3 is aimed squarely at that work.

Agentic annotation

The headline feature in 2.3 is a full system for evaluating AI agents through human annotation.

Agents, meaning systems that take multiple steps to get something done, are everywhere now, and they are genuinely hard to evaluate. One run might rack up dozens of tool calls, reasoning steps, page navigations, and intermediate outputs. Most annotation tools flatten all of that into plain text, which throws away exactly the structure an evaluator needs to see.

Potato's agentic annotation system has three parts.

12 trace format converters

Agent traces look different depending on the framework that produced them. Potato normalizes them into one representation:

Converter	Source
`openai`	OpenAI Assistants API / function calling
`anthropic`	Anthropic Claude tool_use / Messages API
`swebench`	SWE-bench task traces
`opentelemetry`	OpenTelemetry span exports
`mcp`	Model Context Protocol sessions
`multi_agent`	CrewAI / AutoGen / LangGraph
`langchain`	LangChain callback traces
`langfuse`	LangFuse observation exports
`react`	ReAct Thought/Action/Observation
`webarena`	WebArena / VisualWebArena
`atif`	Agent Trace Interchange Format
`raw_web`	Raw browser recordings (HAR + screenshots)

The config is short:

yaml

agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"

If you are pulling traces from several sources at once, let it auto-detect:

yaml

agentic:
  enabled: true
  trace_converter: auto

Three display types

Different kinds of agents call for different visualizations.

Agent Trace Display renders tool-using agent traces as color-coded step cards, with collapsible observations, pretty-printed JSON, and a timeline sidebar:

yaml

agentic:
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    show_step_numbers: true

Web Agent Trace Display handles browsing agents: full screenshots, SVG overlays marking click targets and input fields, and a filmstrip for jumping around quickly:

yaml

agentic:
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
      click_marker: "circle"
      click_color: "#ef4444"
    filmstrip:
      enabled: true

Interactive Chat Display covers two cases: trace review, where you evaluate a recorded conversation, and live chat, where annotators talk to an agent in real time and then rate the conversation:

yaml

agentic:
  display_type: interactive_chat
  interactive_chat_display:
    mode: trace_review
    trace_review:
      show_token_counts: true
      show_latency: true

Per-turn ratings

With any display type, annotators can rate individual steps as well as the trace overall:

yaml

annotation_schemes:
  - annotation_type: likert
    name: overall_quality
    min: 1
    max: 5
 
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"

Pre-built schemas

Nine ready-made schemas cover the usual agent evaluation dimensions out of the box:

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_safety

The presets are agent_task_success, agent_step_correctness, agent_error_taxonomy, agent_safety, agent_efficiency, agent_instruction_following, agent_explanation_quality, agent_web_action_correctness, and agent_conversation_quality.

Read the agentic annotation documentation →

Solo Mode

The second big feature in 2.3 is Solo Mode, a 12-phase workflow that swaps the usual crowd of annotators for one human expert working alongside an LLM.

The problem

You normally need several annotators to get reliable labels. Hiring, training, and coordinating that team is slow and expensive. On a lot of research projects the bottleneck is not the annotation interface at all, it is the logistics of running a team.

How Solo Mode handles it

One domain expert labels a carefully chosen slice of the data. An LLM learns from those labels, proposes labels for everything else, and the human only steps back in where the LLM is unsure. A 12-phase workflow runs the whole loop.

In our internal benchmarks, Solo Mode matched full multi-annotator pipelines at 95% agreement or better, using only 10 to 15% of the human labels.

The 12 phases

Seed annotation: the human labels 50 diverse instances.
Initial LLM calibration: the LLM labels a calibration batch using those seed examples.
Confusion analysis: find the patterns where human and LLM systematically disagree.
Guideline refinement: the LLM proposes better guidelines and the human approves them.
Labeling function generation: ALCHEmist-inspired programmatic rules for the easy instances.
Active labeling: the human labels the most informative instances that remain.
Automated refinement loop: re-label iteratively as the guidelines improve.
Disagreement exploration: the human resolves cases where the LLM and the labeling functions clash.
Edge case synthesis: the LLM invents ambiguous examples for the human to label.
Cascaded confidence escalation: the human reviews the LLM's lowest-confidence labels.
Prompt optimization: a DSPy-inspired automated prompt search.
Final validation: review a random sample, then pass or cycle back.

Quick start

yaml

solo_mode:
  enabled: true
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
  seed_count: 50
  accuracy_threshold: 0.92
  confidence_threshold: 0.85
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: [Positive, Neutral, Negative]

Multi-signal instance prioritization

Solo Mode draws from six weighted pools to decide which instances are worth a human's time:

yaml

solo_mode:
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05

Read the Solo Mode documentation →

Best-Worst Scaling

Potato 2.3 adds Best-Worst Scaling (BWS), sometimes called Maximum Difference Scaling. Annotators see a tuple of items, usually four, and pick the best and the worst against some criterion. From those simple binary judgments BWS works out reliable scalar scores, and it gets there with far fewer annotations than a Likert scale would need for the same statistical power.

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
    randomize_order: true
 
    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5
 
    scoring:
      method: bradley_terry
      auto_compute: true
      include_confidence: true

There are three scoring methods. Counting is the simple one: (best_count - worst_count) / appearances. Bradley-Terry is a pairwise comparison model and the recommended default. Plackett-Luce is a full ranking model when you want to squeeze the most out of your data.

Score from the CLI:

bash

python -m potato.bws score --config config.yaml --method bradley_terry --output scores.csv

The admin dashboard has a BWS tab showing score distributions, convergence charts, and split-half reliability.

Read the Best-Worst Scaling documentation →

SSO and OAuth authentication

A production annotation deployment needs real authentication. Potato 2.3 supports three OAuth methods.

Google OAuth

yaml

authentication:
  method: google_oauth
  google_oauth:
    client_id: ${GOOGLE_CLIENT_ID}
    client_secret: ${GOOGLE_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/google/callback"
    allowed_domains:
      - "umich.edu"
    auto_register: true

GitHub OAuth with Organization Restriction

yaml

authentication:
  method: github_oauth
  github_oauth:
    client_id: ${GITHUB_CLIENT_ID}
    client_secret: ${GITHUB_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/github/callback"
    allowed_organizations:
      - "my-research-lab"
    scopes:
      - "read:user"
      - "read:org"

Generic OIDC

Connect to Okta, Azure AD, Auth0, Keycloak, or anything else that speaks OIDC:

yaml

authentication:
  method: oidc
  oidc:
    discovery_url: "https://accounts.example.com/.well-known/openid-configuration"
    client_id: ${OIDC_CLIENT_ID}
    client_secret: ${OIDC_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/oidc/callback"

All three support domain restriction, auto-registration, and mixed mode, where several auth methods share one login page.

Read the SSO and OAuth documentation →

Parquet export

More and more, annotation data ends up in data science tools that want columnar formats. Potato 2.3 can export straight to Apache Parquet as three files:

annotations.parquet, one row per (instance, annotator, schema) with values, timestamps, and durations
spans.parquet, one row per annotated span with offsets, labels, and links
items.parquet, instance metadata with annotation counts and status

yaml

parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd
  auto_export: true

Load directly in pandas, DuckDB, PyArrow, Polars, or Hugging Face Datasets:

python

import pandas as pd
annotations = pd.read_parquet("output/parquet/annotations.parquet")
 
# Or with DuckDB for SQL queries
import duckdb
duckdb.sql("""
  SELECT instance_id, value, COUNT(*) as n
  FROM 'output/parquet/annotations.parquet'
  WHERE schema_name = 'sentiment'
  GROUP BY instance_id, value
""")

It handles snappy, gzip, zstd, lz4, and brotli compression, incremental export partitioned by date or annotator, and dictionary encoding for string columns.

Read the Parquet export documentation →

15 new demo projects

Potato 2.3 ships 15 new demos in the project-hub/ directory, one or more for each new feature.

For agentic annotation:

react-agent-eval, evaluating ReAct agent traces with step-level ratings
web-agent-eval, WebArena trace evaluation with screenshot overlays
chatbot-eval, interactive chat evaluation with a live agent proxy
multi-agent-eval, evaluating CrewAI multi-agent systems
swebench-eval, SWE-bench trace evaluation for coding agents

For Solo Mode: 6. solo-sentiment, sentiment classification on product reviews 7. solo-ner, named entity recognition 8. solo-toxicity, toxicity detection with edge case synthesis

For Best-Worst Scaling: 9. bws-translation, machine translation quality ranking 10. bws-summarization, summary quality comparison 11. bws-image-quality, image generation quality ranking

For authentication: 12. google-oauth-demo, a Google OAuth setup example 13. github-oauth-demo, GitHub OAuth with org restriction

For export: 14. parquet-export-demo, Parquet export with a DuckDB analysis script 15. huggingface-upload, export to Parquet and push to the Hugging Face Hub

Each demo comes with a full config.yaml, sample data, and a README. Start any of them with:

bash

cd project-hub/react-agent-eval
potato start config.yaml

Security hardening

A handful of security improvements landed in 2.3:

Session tokens now use cryptographically secure random generation with configurable expiration
CSRF protection is on by default for every form submission
Rate limiting on the authentication endpoints (configurable, 10 attempts per minute by default)
Input sanitization for any user-provided content shown in the annotation interface
A dependency audit that brought all Python and JavaScript dependencies up to current secure versions
Content Security Policy headers to head off XSS

yaml

security:
  csrf_protection: true
  rate_limiting:
    auth_attempts: 10            # per minute
    api_requests: 100            # per minute
  session:
    token_length: 64
    lifetime_hours: 24
  content_security_policy: true

Upgrading

From Potato 2.2.x

bash

pip install --upgrade potato-annotation

Your v2.2 configs are fully backward-compatible, so nothing needs to change.

New dependencies

Parquet export needs PyArrow:

bash

pip install potato-annotation[parquet]

Solo Mode requires an LLM provider SDK:

bash

pip install potato-annotation[solo]    # installs openai + anthropic SDKs

Or install everything:

bash

pip install potato-annotation[all]

What's next

A few things we are already working on for the next release:

Annotation diffing, to compare annotations across rounds and annotators with visual diffs
Federated annotation, to coordinate work across multiple Potato instances
Streaming data sources, to annotate from Kafka, Pub/Sub, and similar systems
A mobile-friendly interface for annotating on tablets and phones

We would genuinely like to hear what you think. File issues on GitHub, start a thread in GitHub Discussions, or just reach out to the team.

For the full changelog, including any config keys that changed, see the v2.3.0 release notes in the repository.

Potato 2.3: Agentic Annotation, Solo Mode, and the Future of Human Evaluation

Agentic annotation

12 trace format converters

Three display types

Per-turn ratings

Pre-built schemas

Solo Mode

The problem

How Solo Mode handles it

The 12 phases

Quick start

Multi-signal instance prioritization

Best-Worst Scaling

SSO and OAuth authentication

Google OAuth

GitHub OAuth with Organization Restriction

Generic OIDC

Parquet export

15 new demo projects

Security hardening

Upgrading

From Potato 2.2.x

New dependencies

What's next

Links