Note: The feature counts in this post reflect the state at the v2.4.0 release. Potato now supports 50+ annotation types. See the annotation types documentation for the full list.

Potato 2.4.0 is out. It's our biggest update since agentic annotation landed in 2.3, and it adds the agent-evaluation features people kept asking for, plus a batch of enterprise and integration work.

Web Agent Annotation

Evaluating web-browsing agents is hard. You need to see what the agent saw, where it clicked, how it scrolled, and whether each step made sense. Potato 2.4 adds a Web Agent Trace Viewer for this.

Review Mode gives annotators a filmstrip view through pre-recorded screenshots. SVG overlays mark click targets, bounding boxes, mouse paths, and scroll positions, so evaluators see what the agent saw, with annotation controls inline.

Creation Mode flips the interface around. Annotators browse a live website inside an iframe, and Potato records every interaction as an annotation-ready trace. You can import existing traces from WebArena, Mind2Web, and Anthropic Computer Use formats, or record new ones as you go.

yaml

display:
  type: web_agent_trace
  mode: review          # or "creation"
  show_overlays: true
  keyboard_shortcuts: true

Live Agent Evaluation

Sometimes you need to evaluate agents while they run, not after the fact. The new Live Agent Evaluation system lets annotators watch AI agents execute tasks in real time and annotate their behavior mid-execution.

Potato runs agents in parallel through the Agent Runner Manager, captures traces as they arrive via a webhook receiver, and shows annotators a real-time evaluation interface. It tracks step-level inter-annotator agreement automatically.

LLM Chat Sidebar

Hard annotation calls benefit from a second opinion. The new LLM Chat Sidebar gives annotators an AI assistant panel they can consult mid-task without leaving the interface.

The sidebar handles multi-turn conversations and injects the full task context automatically. It works with OpenAI, Anthropic, and Ollama endpoints, and it logs every conversation as behavioral data, which is handy if you want to study how annotators lean on AI assistance.

yaml

llm_sidebar:
  enabled: true
  provider: anthropic
  model: claude-3-5-sonnet-20241022
  system_prompt: "You are a helpful annotation assistant for this {task_name} task."
  collapsible: true

HuggingFace Ecosystem Integration

Potato now connects to HuggingFace in a few ways. You can push annotations straight to Hub datasets with auto-generated DatasetCards, load them back as datasets.Dataset objects without a round trip, deploy a Potato instance to HuggingFace Spaces, and ingest traces automatically when you run LangChain agents through the LangChain callback.

bash

pip install potato-annotation[huggingface]

python

from potato import PotatoDataset
 
ds = PotatoDataset.from_output("annotations/")
ds.push_to_hub("my-org/my-annotation-dataset")

Webhook System

Potato 2.4 ships a full webhook system for event-driven integrations. Five event types, signed with HMAC-SHA256 per the Standard Webhooks spec:

Event	Triggers when
`annotation.created`	An annotator submits a label
`item.fully_annotated`	An item reaches its required overlap count
`task.completed`	All items in a task are annotated
`user.phase_completed`	A user finishes a phase (Solo Mode)
`quality.attention_check_failed`	An annotator fails an attention check

Webhooks are delivered non-blocking with configurable retry, and managed via the admin API.

yaml

webhooks:
  - url: https://your-system.example.com/potato-events
    secret: your-signing-secret
    events: [annotation.created, item.fully_annotated]

Advanced Active Learning: 5 Strategies + LLM Cold-Start

The active learning system now ships five query strategies:

Uncertainty sampling: Select instances the model is least confident about
Diversity-based selection: Maximize coverage of the input space
BADGE: Batch Active Learning by Diverse Gradient Embeddings
BALD: Bayesian Active Learning by Disagreement
Hybrid ensemble: Combine strategies for robust selection

There's also LLM cold-start, which picks instances before any labels exist. You point a language model at your pool and let it surface the challenging or representative items to seed annotation. CoverICL is new too, for picking diverse in-context learning examples.

Password Management and SSO/OAuth

Two authentication features people kept requesting:

Password management uses PBKDF2-SHA256 hashing with per-user salts, supports admin CLI and API password resets, and includes a self-service token-based reset flow backed by SQLite or PostgreSQL.

SSO/OAuth handles single sign-on through Google, GitHub, or any generic OIDC provider via Authlib.

bash

pip install potato-annotation[auth]

Updated Counts

Capability	2.3	2.4
Annotation types	20	21
Display types	15	17+
AI endpoints	7	11
Example projects	15	40+
Active learning strategies	1	5
Webhook event types	0	5
Agent example projects	0	14

Install

bash

pip install potato-annotation           # core
pip install potato-annotation[ai]       # OpenAI, Ollama
pip install potato-annotation[huggingface]  # HF Hub + Spaces
pip install potato-annotation[langchain]    # LangChain callback
pip install potato-annotation[auth]         # SSO/OAuth
pip install potato-annotation[all]          # everything

Try It

The fastest way to see 2.4 in action is the live demo on HuggingFace Spaces, with no installation needed. It runs an agent trace evaluation task with radio buttons, likert scales, span annotation, and free-text notes:

Try the live demo →

Or run an example locally:

bash

git clone https://github.com/davidjurgens/potato.git
cd potato
pip install -e .
python potato/flask_server.py start examples/agent-traces/complex-annotation/config.yaml -p 8000

For the complete changelog, see the v2.4.0 release notes, and the rest of the docs in the GitHub repository.