# Potato 2.3: Agentic Annotation, Solo Mode, and the Future of Human Evaluation

Source: https://www.potatoannotator.com/blog/potato-2-3-release

> **Note:** This post describes Potato 2.3 as it was at release. Some configuration keys and features have been updated in later versions. See the [current documentation](/docs) for up-to-date configuration syntax.

Potato 2.3.0 is our largest release so far. It brings two new systems, agentic annotation and Solo Mode, plus Best-Worst Scaling, SSO/OAuth authentication, Parquet export, and 15 new demo projects.

The reason for most of this is that what people annotate has changed. A lot of researchers are no longer just labeling sentiment and named entities. They are picking apart multi-step agent traces, comparing LLM outputs in bulk, and building datasets for tasks that did not exist a couple of years ago. Potato 2.3 is aimed squarely at that work.

---

## Agentic annotation

The headline feature in 2.3 is a full system for evaluating AI agents through human annotation.

Agents, meaning systems that take multiple steps to get something done, are everywhere now, and they are genuinely hard to evaluate. One run might rack up dozens of tool calls, reasoning steps, page navigations, and intermediate outputs. Most annotation tools flatten all of that into plain text, which throws away exactly the structure an evaluator needs to see.

Potato's agentic annotation system has three parts.

### 12 trace format converters

Agent traces look different depending on the framework that produced them. Potato normalizes them into one representation:

| Converter | Source |
|-----------|--------|
| `openai` | OpenAI Assistants API / function calling |
| `anthropic` | Anthropic Claude tool_use / Messages API |
| `swebench` | SWE-bench task traces |
| `opentelemetry` | OpenTelemetry span exports |
| `mcp` | Model Context Protocol sessions |
| `multi_agent` | CrewAI / AutoGen / LangGraph |
| `langchain` | LangChain callback traces |
| `langfuse` | LangFuse observation exports |
| `react` | ReAct Thought/Action/Observation |
| `webarena` | WebArena / VisualWebArena |
| `atif` | Agent Trace Interchange Format |
| `raw_web` | Raw browser recordings (HAR + screenshots) |

The config is short:

```yaml
agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"
```

If you are pulling traces from several sources at once, let it auto-detect:

```yaml
agentic:
  enabled: true
  trace_converter: auto
```

### Three display types

Different kinds of agents call for different visualizations.

Agent Trace Display renders tool-using agent traces as color-coded step cards, with collapsible observations, pretty-printed JSON, and a timeline sidebar:

```yaml
agentic:
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    show_step_numbers: true
```

Web Agent Trace Display handles browsing agents: full screenshots, SVG overlays marking click targets and input fields, and a filmstrip for jumping around quickly:

```yaml
agentic:
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
      click_marker: "circle"
      click_color: "#ef4444"
    filmstrip:
      enabled: true
```

Interactive Chat Display covers two cases: trace review, where you evaluate a recorded conversation, and live chat, where annotators talk to an agent in real time and then rate the conversation:

```yaml
agentic:
  display_type: interactive_chat
  interactive_chat_display:
    mode: trace_review
    trace_review:
      show_token_counts: true
      show_latency: true
```

### Per-turn ratings

With any display type, annotators can rate individual steps as well as the trace overall:

```yaml
annotation_schemes:
  - annotation_type: likert
    name: overall_quality
    min: 1
    max: 5

  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
```

### Pre-built schemas

Nine ready-made schemas cover the usual agent evaluation dimensions out of the box:

```yaml
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_safety
```

The presets are `agent_task_success`, `agent_step_correctness`, `agent_error_taxonomy`, `agent_safety`, `agent_efficiency`, `agent_instruction_following`, `agent_explanation_quality`, `agent_web_action_correctness`, and `agent_conversation_quality`.

[Read the agentic annotation documentation →](/docs/features/agentic-annotation)

---

## Solo Mode

The second big feature in 2.3 is Solo Mode, a 12-phase workflow that swaps the usual crowd of annotators for one human expert working alongside an LLM.

### The problem

You normally need several annotators to get reliable labels. Hiring, training, and coordinating that team is slow and expensive. On a lot of research projects the bottleneck is not the annotation interface at all, it is the logistics of running a team.

### How Solo Mode handles it

One domain expert labels a carefully chosen slice of the data. An LLM learns from those labels, proposes labels for everything else, and the human only steps back in where the LLM is unsure. A 12-phase workflow runs the whole loop.

In our internal benchmarks, Solo Mode matched full multi-annotator pipelines at 95% agreement or better, using only 10 to 15% of the human labels.

### The 12 phases

1. Seed annotation: the human labels 50 diverse instances.
2. Initial LLM calibration: the LLM labels a calibration batch using those seed examples.
3. Confusion analysis: find the patterns where human and LLM systematically disagree.
4. Guideline refinement: the LLM proposes better guidelines and the human approves them.
5. Labeling function generation: ALCHEmist-inspired programmatic rules for the easy instances.
6. Active labeling: the human labels the most informative instances that remain.
7. Automated refinement loop: re-label iteratively as the guidelines improve.
8. Disagreement exploration: the human resolves cases where the LLM and the labeling functions clash.
9. Edge case synthesis: the LLM invents ambiguous examples for the human to label.
10. Cascaded confidence escalation: the human reviews the LLM's lowest-confidence labels.
11. Prompt optimization: a DSPy-inspired automated prompt search.
12. Final validation: review a random sample, then pass or cycle back.

### Quick start

```yaml
solo_mode:
  enabled: true
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
  seed_count: 50
  accuracy_threshold: 0.92
  confidence_threshold: 0.85

annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: [Positive, Neutral, Negative]
```

### Multi-signal instance prioritization

Solo Mode draws from six weighted pools to decide which instances are worth a human's time:

```yaml
solo_mode:
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05
```

[Read the Solo Mode documentation →](/docs/features/solo-mode)

---

## Best-Worst Scaling

Potato 2.3 adds Best-Worst Scaling (BWS), sometimes called Maximum Difference Scaling. Annotators see a tuple of items, usually four, and pick the best and the worst against some criterion. From those simple binary judgments BWS works out reliable scalar scores, and it gets there with far fewer annotations than a Likert scale would need for the same statistical power.

```yaml
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
    randomize_order: true

    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5

    scoring:
      method: bradley_terry
      auto_compute: true
      include_confidence: true
```

There are three scoring methods. Counting is the simple one: (best_count - worst_count) / appearances. Bradley-Terry is a pairwise comparison model and the recommended default. Plackett-Luce is a full ranking model when you want to squeeze the most out of your data.

Score from the CLI:

```bash
python -m potato.bws score --config config.yaml --method bradley_terry --output scores.csv
```

The admin dashboard has a BWS tab showing score distributions, convergence charts, and split-half reliability.

[Read the Best-Worst Scaling documentation →](/docs/annotation-types/best-worst-scaling)

---

## SSO and OAuth authentication

A production annotation deployment needs real authentication. Potato 2.3 supports three OAuth methods.

### Google OAuth

```yaml
authentication:
  method: google_oauth
  google_oauth:
    client_id: ${GOOGLE_CLIENT_ID}
    client_secret: ${GOOGLE_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/google/callback"
    allowed_domains:
      - "umich.edu"
    auto_register: true
```

### GitHub OAuth with Organization Restriction

```yaml
authentication:
  method: github_oauth
  github_oauth:
    client_id: ${GITHUB_CLIENT_ID}
    client_secret: ${GITHUB_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/github/callback"
    allowed_organizations:
      - "my-research-lab"
    scopes:
      - "read:user"
      - "read:org"
```

### Generic OIDC

Connect to Okta, Azure AD, Auth0, Keycloak, or anything else that speaks OIDC:

```yaml
authentication:
  method: oidc
  oidc:
    discovery_url: "https://accounts.example.com/.well-known/openid-configuration"
    client_id: ${OIDC_CLIENT_ID}
    client_secret: ${OIDC_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/oidc/callback"
```

All three support domain restriction, auto-registration, and mixed mode, where several auth methods share one login page.

[Read the SSO and OAuth documentation →](/docs/deployment/sso-oauth)

---

## Parquet export

More and more, annotation data ends up in data science tools that want columnar formats. Potato 2.3 can export straight to Apache Parquet as three files:

- `annotations.parquet`, one row per (instance, annotator, schema) with values, timestamps, and durations
- `spans.parquet`, one row per annotated span with offsets, labels, and links
- `items.parquet`, instance metadata with annotation counts and status

```yaml
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd
  auto_export: true
```

Load directly in pandas, DuckDB, PyArrow, Polars, or Hugging Face Datasets:

```python
import pandas as pd
annotations = pd.read_parquet("output/parquet/annotations.parquet")

# Or with DuckDB for SQL queries
import duckdb
duckdb.sql("""
  SELECT instance_id, value, COUNT(*) as n
  FROM 'output/parquet/annotations.parquet'
  WHERE schema_name = 'sentiment'
  GROUP BY instance_id, value
""")
```

It handles snappy, gzip, zstd, lz4, and brotli compression, incremental export partitioned by date or annotator, and dictionary encoding for string columns.

[Read the Parquet export documentation →](/docs/features/parquet-export)

---

## 15 new demo projects

Potato 2.3 ships 15 new demos in the `project-hub/` directory, one or more for each new feature.

For agentic annotation:
1. `react-agent-eval`, evaluating ReAct agent traces with step-level ratings
2. `web-agent-eval`, WebArena trace evaluation with screenshot overlays
3. `chatbot-eval`, interactive chat evaluation with a live agent proxy
4. `multi-agent-eval`, evaluating CrewAI multi-agent systems
5. `swebench-eval`, SWE-bench trace evaluation for coding agents

For Solo Mode:
6. `solo-sentiment`, sentiment classification on product reviews
7. `solo-ner`, named entity recognition
8. `solo-toxicity`, toxicity detection with edge case synthesis

For Best-Worst Scaling:
9. `bws-translation`, machine translation quality ranking
10. `bws-summarization`, summary quality comparison
11. `bws-image-quality`, image generation quality ranking

For authentication:
12. `google-oauth-demo`, a Google OAuth setup example
13. `github-oauth-demo`, GitHub OAuth with org restriction

For export:
14. `parquet-export-demo`, Parquet export with a DuckDB analysis script
15. `huggingface-upload`, export to Parquet and push to the Hugging Face Hub

Each demo comes with a full `config.yaml`, sample data, and a README. Start any of them with:

```bash
cd project-hub/react-agent-eval
potato start config.yaml
```

---

## Security hardening

A handful of security improvements landed in 2.3:

- Session tokens now use cryptographically secure random generation with configurable expiration
- CSRF protection is on by default for every form submission
- Rate limiting on the authentication endpoints (configurable, 10 attempts per minute by default)
- Input sanitization for any user-provided content shown in the annotation interface
- A dependency audit that brought all Python and JavaScript dependencies up to current secure versions
- Content Security Policy headers to head off XSS

```yaml
security:
  csrf_protection: true
  rate_limiting:
    auth_attempts: 10            # per minute
    api_requests: 100            # per minute
  session:
    token_length: 64
    lifetime_hours: 24
  content_security_policy: true
```

---

## Upgrading

### From Potato 2.2.x

```bash
pip install --upgrade potato-annotation
```

Your v2.2 configs are fully backward-compatible, so nothing needs to change.

### New dependencies

Parquet export needs PyArrow:

```bash
pip install potato-annotation[parquet]
```

Solo Mode requires an LLM provider SDK:

```bash
pip install potato-annotation[solo]    # installs openai + anthropic SDKs
```

Or install everything:

```bash
pip install potato-annotation[all]
```

---

## What's next

A few things we are already working on for the next release:

- Annotation diffing, to compare annotations across rounds and annotators with visual diffs
- Federated annotation, to coordinate work across multiple Potato instances
- Streaming data sources, to annotate from Kafka, Pub/Sub, and similar systems
- A mobile-friendly interface for annotating on tablets and phones

We would genuinely like to hear what you think. File issues on [GitHub](https://github.com/davidjurgens/potato/issues), start a thread in [GitHub Discussions](https://github.com/davidjurgens/potato/discussions), or just reach out to the team.

For the full changelog, including any config keys that changed, see the [v2.3.0 release notes](https://github.com/davidjurgens/potato/blob/master/docs/releasenotes/v2.3.0.md) in the repository.

---

## Links

- [Installation Guide](/docs/getting-started/installation)
- [What's New in v2.3](/docs/getting-started/whats-new-v2)
- [Agentic Annotation](/docs/features/agentic-annotation)
- [Solo Mode](/docs/features/solo-mode)
- [Best-Worst Scaling](/docs/annotation-types/best-worst-scaling)
- [SSO & OAuth](/docs/deployment/sso-oauth)
- [Parquet Export](/docs/features/parquet-export)
- [GitHub Repository](https://github.com/davidjurgens/potato)
- [PyPI Package](https://pypi.org/project/potato-annotation/)
