Skip to content
Blog/Announcements
Announcements10 min read

Potato 2.3: Agentic Annotation, Solo Mode, and the Future of Human Evaluation

Potato 2.3.0 introduces agentic annotation with 12 trace format converters, Solo Mode for human-LLM collaborative labeling, Best-Worst Scaling, SSO/OAuth, Parquet export, and 15 demo projects.

By Potato Team·

Potato 2.3: Agentic Annotation, Solo Mode, and the Future of Human Evaluation

We are excited to announce Potato 2.3.0, the largest release in Potato's history. This update introduces two major new systems -- agentic annotation and Solo Mode -- alongside Best-Worst Scaling, SSO/OAuth authentication, Parquet export, and 15 new demo projects.

The theme of this release is simple: the things we need to annotate have changed, and our tools need to keep up. Researchers are no longer just labeling text sentiment and named entities. They are evaluating multi-step AI agent traces, comparing LLM outputs at scale, and building datasets for increasingly complex tasks. Potato 2.3 is built for this new reality.


Agentic Annotation

The headline feature of Potato 2.3 is a complete system for evaluating AI agents through human annotation.

AI agents -- systems that take multi-step actions to accomplish tasks -- are proliferating rapidly. But evaluating them is hard. A single agent run might involve dozens of tool calls, reasoning steps, web page navigations, and intermediate outputs. Existing annotation tools show agents' outputs as flat text, losing the rich structure that evaluators need to see.

Potato's agentic annotation system solves this with three components.

12 Trace Format Converters

Agent traces come in different formats depending on the framework. Potato normalizes them all into a unified representation:

ConverterSource
openaiOpenAI Assistants API / function calling
anthropicAnthropic Claude tool_use / Messages API
swebenchSWE-bench task traces
opentelemetryOpenTelemetry span exports
mcpModel Context Protocol sessions
multi_agentCrewAI / AutoGen / LangGraph
langchainLangChain callback traces
langfuseLangFuse observation exports
reactReAct Thought/Action/Observation
webarenaWebArena / VisualWebArena
atifAgent Trace Interchange Format
raw_webRaw browser recordings (HAR + screenshots)

Configuration is straightforward:

yaml
agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"

Auto-detection is available for projects that need to ingest traces from multiple sources:

yaml
agentic:
  enabled: true
  trace_converter: auto

Three Display Types

Different agent modalities need different visualizations.

Agent Trace Display renders tool-using agent traces as color-coded step cards with collapsible observations, JSON pretty-printing, and a timeline sidebar:

yaml
agentic:
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    show_step_numbers: true

Web Agent Trace Display renders browsing agent traces with full screenshots, SVG overlays showing click targets and input fields, and a filmstrip for quick navigation:

yaml
agentic:
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
      click_marker: "circle"
      click_color: "#ef4444"
    filmstrip:
      enabled: true

Interactive Chat Display supports both trace review (evaluating a recorded conversation) and live chat (annotators interact with an agent in real time, then evaluate the conversation):

yaml
agentic:
  display_type: interactive_chat
  interactive_chat_display:
    mode: trace_review
    trace_review:
      show_token_counts: true
      show_latency: true

Per-Turn Ratings

For any display type, annotators can rate individual steps alongside the overall trace:

yaml
annotation_schemes:
  - annotation_type: likert
    name: overall_quality
    min: 1
    max: 5
 
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"

Pre-Built Schemas

Nine annotation schemas cover common agent evaluation dimensions out of the box:

yaml
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_safety

Available presets: agent_task_success, agent_step_correctness, agent_error_taxonomy, agent_safety, agent_efficiency, agent_instruction_following, agent_explanation_quality, agent_web_action_correctness, agent_conversation_quality.

Read the Agentic Annotation documentation →


Solo Mode

The second major feature of Potato 2.3 is Solo Mode: a 12-phase workflow that replaces the traditional multi-annotator paradigm with a single human expert collaborating with an LLM.

The Problem

Traditional annotation requires multiple annotators for reliability. But hiring, training, and coordinating a team is expensive and slow. For many research projects, the annotation bottleneck is not the interface -- it is the logistics.

The Solution

Solo Mode lets one domain expert label a strategically selected subset of the data. An LLM learns from those labels, proposes labels for the remaining instances, and the human reviews only the cases where the LLM struggles. A 12-phase workflow orchestrates this automatically.

In internal benchmarks, Solo Mode achieved 95%+ agreement with full multi-annotator pipelines while requiring only 10-15% of the total human labels.

The 12 Phases

  1. Seed Annotation -- human labels 50 diverse instances
  2. Initial LLM Calibration -- LLM labels a calibration batch using seed examples
  3. Confusion Analysis -- identify systematic human-LLM disagreement patterns
  4. Guideline Refinement -- LLM proposes improved guidelines; human approves
  5. Labeling Function Generation -- ALCHEmist-inspired programmatic rules for easy instances
  6. Active Labeling -- human labels the most informative remaining instances
  7. Automated Refinement Loop -- iterative re-labeling with updated guidelines
  8. Disagreement Exploration -- human resolves cases where LLM and labeling functions conflict
  9. Edge Case Synthesis -- LLM generates synthetic ambiguous examples for human labeling
  10. Cascaded Confidence Escalation -- human reviews lowest-confidence LLM labels
  11. Prompt Optimization -- DSPy-inspired automated prompt search
  12. Final Validation -- random sample review; pass or cycle back

Quick Start

yaml
solo_mode:
  enabled: true
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
  seed_count: 50
  accuracy_threshold: 0.92
  confidence_threshold: 0.85
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: [Positive, Neutral, Negative]

Multi-Signal Instance Prioritization

Solo Mode uses six weighted pools to select the most valuable instances for human labeling:

yaml
solo_mode:
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05

Read the Solo Mode documentation →


Best-Worst Scaling

Potato 2.3 adds Best-Worst Scaling (BWS), also known as Maximum Difference Scaling. Annotators see a tuple of items (typically 4) and select the best and worst according to some criterion. BWS produces reliable scalar scores from simple binary judgments, requiring far fewer annotations than Likert scales for the same statistical power.

yaml
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
    randomize_order: true
 
    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5
 
    scoring:
      method: bradley_terry
      auto_compute: true
      include_confidence: true

Three scoring methods are available:

  • Counting -- simple (best_count - worst_count) / appearances
  • Bradley-Terry -- pairwise comparison model (recommended default)
  • Plackett-Luce -- full ranking model for maximum data efficiency

Score from the CLI:

bash
python -m potato.bws score --config config.yaml --method bradley_terry --output scores.csv

The admin dashboard includes a BWS tab with score distributions, convergence charts, and split-half reliability metrics.

Read the Best-Worst Scaling documentation →


SSO & OAuth Authentication

Production annotation deployments need proper authentication. Potato 2.3 supports three OAuth methods:

Google OAuth

yaml
authentication:
  method: google_oauth
  google_oauth:
    client_id: ${GOOGLE_CLIENT_ID}
    client_secret: ${GOOGLE_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/google/callback"
    allowed_domains:
      - "umich.edu"
    auto_register: true

GitHub OAuth with Organization Restriction

yaml
authentication:
  method: github_oauth
  github_oauth:
    client_id: ${GITHUB_CLIENT_ID}
    client_secret: ${GITHUB_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/github/callback"
    allowed_organizations:
      - "my-research-lab"
    scopes:
      - "read:user"
      - "read:org"

Generic OIDC

Connect to Okta, Azure AD, Auth0, Keycloak, or any OIDC-compliant provider:

yaml
authentication:
  method: oidc
  oidc:
    discovery_url: "https://accounts.example.com/.well-known/openid-configuration"
    client_id: ${OIDC_CLIENT_ID}
    client_secret: ${OIDC_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/oidc/callback"

All methods support domain restriction, auto-registration, and mixed mode (multiple auth methods on one login page).

Read the SSO & OAuth documentation →


Parquet Export

Annotation data is increasingly consumed by data science tools that expect columnar formats. Potato 2.3 can export annotations directly to Apache Parquet, producing three structured files:

  • annotations.parquet -- one row per (instance, annotator, schema) with values, timestamps, and durations
  • spans.parquet -- one row per annotated span with offsets, labels, and links
  • items.parquet -- instance metadata with annotation counts and status
yaml
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd
  auto_export: true

Load directly in pandas, DuckDB, PyArrow, Polars, or Hugging Face Datasets:

python
import pandas as pd
annotations = pd.read_parquet("output/parquet/annotations.parquet")
 
# Or with DuckDB for SQL queries
import duckdb
duckdb.sql("""
  SELECT instance_id, value, COUNT(*) as n
  FROM 'output/parquet/annotations.parquet'
  WHERE schema_name = 'sentiment'
  GROUP BY instance_id, value
""")

Supports snappy, gzip, zstd, lz4, and brotli compression, incremental export with date/annotator partitioning, and dictionary encoding for string columns.

Read the Parquet Export documentation →


15 New Demo Projects

Potato 2.3 ships with 15 new demo projects in the project-hub/ directory, covering the new features:

Agentic Annotation Demos

  1. react-agent-eval -- Evaluate ReAct agent traces with step-level ratings
  2. web-agent-eval -- WebArena trace evaluation with screenshot overlays
  3. chatbot-eval -- Interactive chat evaluation with live agent proxy
  4. multi-agent-eval -- Evaluate CrewAI multi-agent systems
  5. swebench-eval -- SWE-bench trace evaluation for coding agents

Solo Mode Demos

  1. solo-sentiment -- Solo Mode sentiment classification on product reviews
  2. solo-ner -- Solo Mode named entity recognition
  3. solo-toxicity -- Solo Mode toxicity detection with edge case synthesis

Best-Worst Scaling Demos

  1. bws-translation -- Machine translation quality ranking
  2. bws-summarization -- Summary quality comparison
  3. bws-image-quality -- Image generation quality ranking

Authentication Demos

  1. google-oauth-demo -- Google OAuth setup example
  2. github-oauth-demo -- GitHub OAuth with org restriction

Export Demos

  1. parquet-export-demo -- Parquet export with DuckDB analysis script
  2. huggingface-upload -- Export to Parquet and push to Hugging Face Hub

Each demo includes a complete config.yaml, sample data, and a README with setup instructions. Start any demo with:

bash
cd project-hub/react-agent-eval
potato start config.yaml

Security Hardening

Potato 2.3 includes several security improvements:

  • Session tokens use cryptographically secure random generation with configurable expiration
  • CSRF protection is enabled by default for all form submissions
  • Rate limiting on authentication endpoints (configurable, default 10 attempts per minute)
  • Input sanitization for all user-provided content displayed in the annotation interface
  • Dependency audit -- all Python and JavaScript dependencies updated to latest secure versions
  • Content Security Policy headers added to prevent XSS
yaml
security:
  csrf_protection: true
  rate_limiting:
    auth_attempts: 10            # per minute
    api_requests: 100            # per minute
  session:
    token_length: 64
    lifetime_hours: 24
  content_security_policy: true

Upgrading

From Potato 2.2.x

bash
pip install --upgrade potato-annotation

All v2.2 configurations are fully backward-compatible. No changes to existing configs are required.

New Dependencies

Parquet export requires PyArrow:

bash
pip install potato-annotation[parquet]

Solo Mode requires an LLM provider SDK:

bash
pip install potato-annotation[solo]    # installs openai + anthropic SDKs

Or install everything:

bash
pip install potato-annotation[all]

What's Next

Potato 2.3 represents a significant expansion of what annotation tools can do. We are already working on the next set of features:

  • Annotation diffing -- compare annotations across rounds and annotators with visual diffs
  • Federated annotation -- coordinate annotation across multiple Potato instances
  • Streaming data sources -- annotate data from Kafka, Pub/Sub, and other streaming systems
  • Mobile-optimized interface -- responsive annotation for tablets and phones

We would love to hear your feedback. File issues on GitHub, join the discussion in GitHub Discussions, or reach out to the team directly.