Potato 2.4.0: Web Agent Annotation, Live Evaluation, and HuggingFace Integration
Potato 2.4.0 ships web agent trace review, real-time live agent evaluation, an LLM chat sidebar, HuggingFace Hub export, webhooks, SSO/OAuth, and five active learning strategies.
We're releasing Potato 2.4.0, the biggest update since the agentic annotation launch in 2.3. This release makes Potato the most complete platform for evaluating AI agents and ships a set of long-requested enterprise and integration features.
Web Agent Annotation
Evaluating web-browsing agents is hard. You need to see exactly what the agent saw, where it clicked, how it scrolled, and whether each step made sense. Potato 2.4 introduces a dedicated Web Agent Trace Viewer built for exactly this.
Review Mode gives annotators a filmstrip navigation view through pre-recorded screenshots. SVG overlays mark click targets, bounding boxes, mouse paths, and scroll positions — so evaluators see what the agent saw, with annotation controls inline.
Creation Mode flips the interface: annotators browse a live website inside an iframe, and Potato automatically records every interaction as an annotation-ready trace. Import existing traces from WebArena, Mind2Web, and Anthropic Computer Use formats, or create new ones on the fly.
display:
type: web_agent_trace
mode: review # or "creation"
show_overlays: true
keyboard_shortcuts: trueLive Agent Evaluation
Sometimes you need to evaluate agents while they run, not after the fact. The new Live Agent Evaluation system lets annotators watch AI agents execute tasks in real time and annotate their behavior mid-execution.
Potato manages parallel agent execution through the Agent Runner Manager, captures traces as they arrive via a webhook receiver, and presents annotators with a real-time evaluation interface. Step-level inter-annotator agreement is tracked automatically.
LLM Chat Sidebar
Difficult annotation decisions benefit from a second opinion. The new LLM Chat Sidebar gives annotators an AI assistant panel they can consult mid-task without leaving the interface.
The sidebar supports multi-turn conversations with full task context injected automatically. It works with OpenAI, Anthropic, and Ollama endpoints, and every conversation is logged as behavioral data — useful for studying how annotators use AI assistance.
llm_sidebar:
enabled: true
provider: anthropic
model: claude-3-5-sonnet-20241022
system_prompt: "You are a helpful annotation assistant for this {task_name} task."
collapsible: trueHuggingFace Ecosystem Integration
Potato now has deep HuggingFace integration:
- Push to Hub: Export annotations directly to HuggingFace Hub datasets with auto-generated DatasetCards
- Load as HF Dataset: Access annotations as
datasets.Datasetobjects with zero round-trips - One-click Spaces deployment: Deploy your Potato instance to HuggingFace Spaces
- LangChain callback: Automatic trace ingestion when running LangChain agents
pip install potato-annotation[huggingface]from potato import PotatoDataset
ds = PotatoDataset.from_output("annotations/")
ds.push_to_hub("my-org/my-annotation-dataset")Webhook System
Potato 2.4 ships a full webhook system for event-driven integrations. Five event types, signed with HMAC-SHA256 per the Standard Webhooks spec:
| Event | Triggers when |
|---|---|
annotation.created | An annotator submits a label |
item.fully_annotated | An item reaches its required overlap count |
task.completed | All items in a task are annotated |
user.phase_completed | A user finishes a phase (Solo Mode) |
quality.attention_check_failed | An annotator fails an attention check |
Webhooks are delivered non-blocking with configurable retry, and managed via the admin API.
webhooks:
- url: https://your-system.example.com/potato-events
secret: your-signing-secret
events: [annotation.created, item.fully_annotated]Advanced Active Learning: 5 Strategies + LLM Cold-Start
The active learning system now ships five query strategies:
- Uncertainty sampling — Select instances the model is least confident about
- Diversity-based selection — Maximize coverage of the input space
- BADGE — Batch Active Learning by Diverse Gradient Embeddings
- BALD — Bayesian Active Learning by Disagreement
- Hybrid ensemble — Combine strategies for robust selection
New in 2.4: LLM cold-start for intelligent instance selection before any labels exist. Use a language model to identify challenging or representative instances to seed the annotation process. Also new: CoverICL for selecting diverse in-context learning examples.
Password Management and SSO/OAuth
Two long-requested authentication features:
Password Management: PBKDF2-SHA256 hashing with per-user salts, admin CLI and API password reset, and a self-service token-based reset flow backed by SQLite or PostgreSQL.
SSO/OAuth: Single sign-on via Google, GitHub, or any generic OIDC provider through Authlib.
pip install potato-annotation[auth]Updated Counts
| Capability | 2.3 | 2.4 |
|---|---|---|
| Annotation types | 20 | 21 |
| Display types | 15 | 17+ |
| AI endpoints | 7 | 11 |
| Example projects | 15 | 40+ |
| Active learning strategies | 1 | 5 |
| Webhook event types | 0 | 5 |
| Agent example projects | 0 | 14 |
Install
pip install potato-annotation # core
pip install potato-annotation[ai] # OpenAI, Ollama
pip install potato-annotation[huggingface] # HF Hub + Spaces
pip install potato-annotation[langchain] # LangChain callback
pip install potato-annotation[auth] # SSO/OAuth
pip install potato-annotation[all] # everythingTry It
The fastest way to see 2.4 in action is the live demo on HuggingFace Spaces — no installation needed. It showcases agent trace evaluation with radio buttons, likert scales, span annotation, and free-text notes:
Or run an example locally:
git clone https://github.com/davidjurgens/potato.git
cd potato
pip install -e .
python potato/flask_server.py start examples/agent-traces/complex-annotation/config.yaml -p 8000Full release notes and changelog are in the GitHub repository.