Agentic Annotation
Evaluate AI agents with specialized trace displays, 12 format converters, and purpose-built annotation schemas.
Agentic Annotation
New in v2.3.0
AI agents are increasingly deployed for complex multi-step tasks: browsing the web, writing code, calling APIs, and orchestrating sub-agents. But evaluating whether an agent actually did the right thing requires human judgment at a granularity that traditional annotation tools cannot support. A single agent trace may contain dozens of steps, tool calls, intermediate reasoning, screenshots, and branching decisions. Annotators need to see all of this context, navigate it efficiently, and provide structured evaluations at both the trace level and the individual step level.
Potato's agentic annotation system addresses this with three capabilities:
- 12 trace format converters that normalize agent logs from any major framework into a unified format
- 3 specialized display types optimized for different agent modalities (tool-use, web browsing, chat)
- 9 pre-built annotation schemas covering the most common agent evaluation dimensions
Trace Format Converters
Agent traces come in wildly different formats depending on the framework. Potato ships 12 converters that normalize these into a unified internal representation. You specify the converter in your config, or let Potato auto-detect the format.
Converter Reference
| Converter | Source Format | Key Fields Extracted |
|---|---|---|
openai | OpenAI Assistants API / function calling logs | messages, tool_calls, function results |
anthropic | Anthropic Claude tool_use / Messages API | content blocks, tool_use, tool_result |
swebench | SWE-bench task traces | patch, test results, trajectory |
opentelemetry | OpenTelemetry span exports (JSON) | spans, attributes, events, parent-child |
mcp | Model Context Protocol sessions | tool definitions, call/response pairs |
multi_agent | CrewAI / AutoGen / LangGraph multi-agent logs | agent roles, delegation, message passing |
langchain | LangChain callback traces | chain runs, LLM calls, tool invocations |
langfuse | LangFuse observation exports | generations, spans, scores |
react | ReAct-style Thought/Action/Observation logs | thought, action, action_input, observation |
webarena | WebArena / VisualWebArena trace JSON | actions, screenshots, DOM snapshots, URLs |
atif | Agent Trace Interchange Format (ATIF) | steps, observations, metadata |
raw_web | Raw browser recordings (HAR + screenshots) | requests, responses, screenshots, timings |
Configuration
Specify the converter in your project config:
agentic:
enabled: true
trace_converter: react
trace_file: "data/agent_traces.jsonl"Each line in the trace file should be a JSON object containing the raw agent trace. The converter handles the rest.
For multi-agent traces where different agents use different frameworks, you can specify per-agent converters:
agentic:
enabled: true
trace_converter: multi_agent
trace_file: "data/multi_agent_traces.jsonl"
multi_agent:
agent_converters:
planner: react
coder: anthropic
reviewer: openaiAuto-Detection
If you are unsure which converter to use, set trace_converter: auto:
agentic:
enabled: true
trace_converter: auto
trace_file: "data/traces.jsonl"Potato inspects the first 10 traces and selects the best-matching converter based on field signatures. A warning is logged if confidence is below 80%, in which case you should specify the converter explicitly.
Custom Converters
If your agent framework is not listed, you can write a Python converter:
# converters/my_converter.py
from potato.agentic.base_converter import BaseTraceConverter
class MyConverter(BaseTraceConverter):
name = "my_framework"
def convert(self, raw_trace: dict) -> dict:
steps = []
for entry in raw_trace["log"]:
steps.append({
"type": entry.get("kind", "action"),
"content": entry["text"],
"timestamp": entry.get("ts"),
"metadata": entry.get("extra", {}),
})
return {"steps": steps}Register it in config:
agentic:
trace_converter: custom
custom_converter: "converters/my_converter.py:MyConverter"Display Types
Once traces are converted, Potato renders them using one of three specialized display types. Each is optimized for a different agent modality.
1. Agent Trace Display
The default display for tool-using agents (OpenAI function calling, Anthropic tool_use, ReAct, LangChain, etc.). It renders each step as a card with color-coding by step type.
agentic:
enabled: true
trace_converter: openai
display_type: agent_trace
agent_trace_display:
# Color coding for step types
colors:
thought: "#6E56CF"
action: "#3b82f6"
observation: "#22c55e"
error: "#ef4444"
system: "#6b7280"
# Collapsible sections
collapse_observations: true
collapse_threshold: 500 # characters before auto-collapsing
# Step numbering
show_step_numbers: true
show_timestamps: true
# Tool call rendering
render_json: true # pretty-print JSON arguments
syntax_highlight: true # highlight code in observationsFeatures:
- Step cards with colored left-border indicating type (thought, action, observation, error)
- Collapsible sections for long observations or tool outputs (configurable threshold)
- JSON pretty-printing for tool call arguments and structured responses
- Syntax highlighting for code blocks in observations
- Step timeline sidebar showing the full trace at a glance
- Jump-to-step navigation for long traces
2. Web Agent Trace Display
Purpose-built for web browsing agents (WebArena, VisualWebArena, raw browser recordings). Renders screenshots with SVG overlays showing where the agent clicked, typed, or scrolled.
agentic:
enabled: true
trace_converter: webarena
display_type: web_agent
web_agent_display:
# Screenshot rendering
screenshot_max_width: 900
screenshot_quality: 85
# SVG overlay for agent actions
overlay:
enabled: true
click_marker: "circle" # circle, crosshair, or arrow
click_color: "#ef4444"
click_radius: 20
type_highlight: "#3b82f6" # highlight for text input fields
scroll_indicator: true
# Filmstrip view
filmstrip:
enabled: true
thumbnail_width: 150
show_action_labels: true
# DOM snapshot display
show_dom_snapshot: false # optional raw DOM view
show_url_bar: true
show_action_description: trueFeatures:
- Screenshot gallery with full-size viewing and zoom
- SVG overlays showing click targets (red circles), text input regions (blue highlights), and scroll directions
- Filmstrip view at the bottom showing all screenshots as thumbnails for quick navigation
- Action description text below each screenshot (e.g., "Click on 'Add to Cart' button")
- URL bar showing the current page URL at each step
- Before/after comparison for steps that modify page content
3. Interactive Chat Display
For evaluating conversational agents and chatbots. Supports two sub-modes: live chat where annotators interact with the agent in real time, and trace review where annotators evaluate a recorded conversation.
agentic:
enabled: true
display_type: interactive_chat
interactive_chat_display:
mode: trace_review # or "live_chat"
# Trace review settings
trace_review:
show_system_prompt: false
show_token_counts: true
show_latency: true
message_grouping: turn # "turn" or "message"
# Live chat settings (when mode: live_chat)
live_chat:
proxy: openai # agent proxy to use
max_turns: 20
timeout_seconds: 60
show_typing_indicator: true
allow_regenerate: true
# Common settings
show_role_labels: true
role_colors:
user: "#3b82f6"
assistant: "#6E56CF"
system: "#6b7280"
tool: "#22c55e"Trace review mode renders a recorded conversation with optional token counts and latency per message. Annotators can rate individual turns or the entire conversation.
Live chat mode connects annotators to a running agent via the Agent Proxy System (see below). Annotators converse with the agent, then annotate the resulting conversation.
Per-Turn Ratings
For dialogue and multi-step evaluations, you often need ratings on individual turns rather than (or in addition to) the overall trace. Potato supports per-turn annotation for any display type.
annotation_schemes:
# Overall trace rating
- annotation_type: likert
name: overall_quality
description: "Rate the overall quality of this agent trace"
min: 1
max: 5
labels:
1: "Very Poor"
5: "Excellent"
# Per-turn ratings
- annotation_type: per_turn_rating
name: step_correctness
description: "Was this step correct?"
target: agentic_steps # binds to trace steps
rating_type: radio
labels:
- "Correct"
- "Partially Correct"
- "Incorrect"
- "Unnecessary"
- annotation_type: per_turn_rating
name: step_explanation
description: "Explain any issues with this step"
target: agentic_steps
rating_type: text
conditional:
show_when:
step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]Per-turn ratings appear inline next to each step card. The conditional block lets you show follow-up questions only when certain ratings are selected, keeping the interface clean.
Per-Turn Output Format
Per-turn annotations are saved with step indices:
{
"id": "trace_042",
"annotations": {
"overall_quality": 3,
"step_correctness": {
"0": "Correct",
"1": "Correct",
"2": "Incorrect",
"3": "Correct"
},
"step_explanation": {
"2": "The agent searched for the wrong product name"
}
}
}Agent Proxy System
For live evaluation tasks where annotators interact with an agent in real time, Potato provides an agent proxy layer. The proxy sits between the annotation interface and the agent backend, logging the full conversation for later review.
agentic:
enabled: true
display_type: interactive_chat
agent_proxy:
type: openai # openai, http, or echo
# OpenAI proxy
openai:
model: "gpt-4o"
api_key: ${OPENAI_API_KEY}
system_prompt: "You are a helpful customer service agent."
temperature: 0.7
max_tokens: 1024Proxy Types
OpenAI proxy forwards messages to an OpenAI-compatible API:
agent_proxy:
type: openai
openai:
model: "gpt-4o"
api_key: ${OPENAI_API_KEY}
system_prompt: "You are a helpful assistant."
temperature: 0.7HTTP proxy forwards messages to any HTTP endpoint (your own agent server):
agent_proxy:
type: http
http:
url: "https://my-agent.example.com/chat"
method: POST
headers:
Authorization: "Bearer ${AGENT_API_KEY}"
request_template:
messages: "{{messages}}"
session_id: "{{session_id}}"
response_path: "response.content"
timeout_seconds: 30Echo proxy mirrors the user's message back (useful for testing and UI development):
agent_proxy:
type: echo
echo:
prefix: "[Echo] "
delay_ms: 500Pre-Built Annotation Schemas
Potato ships 9 annotation schemas designed specifically for agent evaluation. Use them directly or as starting points for your own schemas.
| Schema | Type | Description |
|---|---|---|
agent_task_success | radio | Binary success/failure with partial credit option |
agent_step_correctness | per_turn_rating (radio) | Per-step correct/incorrect/unnecessary ratings |
agent_error_taxonomy | per_turn_rating (multiselect) | 12-category error taxonomy (wrong tool, hallucination, loop, etc.) |
agent_safety | radio + text | Safety violation detection with severity scale |
agent_efficiency | likert | Rate whether the agent used an efficient path |
agent_instruction_following | likert | Rate adherence to the original user instruction |
agent_explanation_quality | likert | Rate quality of agent's reasoning/explanations |
agent_web_action_correctness | per_turn_rating (radio) | Per-step web action evaluation (correct target, correct action type) |
agent_conversation_quality | multirate | Multi-dimensional chat quality (helpfulness, accuracy, tone, safety) |
Load a pre-built schema by name:
annotation_schemes:
- preset: agent_task_success
- preset: agent_step_correctness
- preset: agent_error_taxonomyOr combine presets with custom schemas:
annotation_schemes:
- preset: agent_task_success
- preset: agent_step_correctness
# Custom schema alongside presets
- annotation_type: text
name: evaluator_notes
description: "Any additional observations about this agent trace"
label_requirement:
required: falseFull Example: Evaluating a ReAct Agent
Here is a complete configuration for evaluating ReAct-style agent traces with per-step ratings:
# project config
task_name: "ReAct Agent Evaluation"
task_dir: "."
data_files:
- "data/react_traces.jsonl"
item_properties:
id_key: trace_id
text_key: task_description
agentic:
enabled: true
trace_converter: react
display_type: agent_trace
agent_trace_display:
colors:
thought: "#6E56CF"
action: "#3b82f6"
observation: "#22c55e"
error: "#ef4444"
collapse_observations: true
collapse_threshold: 300
show_step_numbers: true
render_json: true
annotation_schemes:
- preset: agent_task_success
- preset: agent_step_correctness
- preset: agent_efficiency
- annotation_type: text
name: failure_reason
description: "If the agent failed, describe what went wrong"
label_requirement:
required: false
output_annotation_dir: "output/"
output_annotation_format: "jsonl"Sample input data (data/react_traces.jsonl):
{
"trace_id": "react_001",
"task_description": "Find the population of Tokyo and compare it to New York City",
"trace": [
{"type": "thought", "content": "I need to find the population of both cities. Let me search for Tokyo first."},
{"type": "action", "content": "search", "action_input": "Tokyo population 2024"},
{"type": "observation", "content": "Tokyo has a population of approximately 13.96 million in the city proper..."},
{"type": "thought", "content": "Now I need to find New York City's population."},
{"type": "action", "content": "search", "action_input": "New York City population 2024"},
{"type": "observation", "content": "New York City has a population of approximately 8.34 million..."},
{"type": "thought", "content": "Tokyo (13.96M) has about 67% more people than NYC (8.34M)."},
{"type": "action", "content": "finish", "action_input": "Tokyo has ~13.96 million people vs NYC's ~8.34 million, making Tokyo about 67% larger by population."}
]
}Start the server:
potato start config.yaml -p 8000Further Reading
- Evaluating AI Agents: A Complete Guide -- walkthrough of a full agent evaluation project
- Annotating Web Browsing Agents -- guide to web agent evaluation with screenshots and overlays
- Solo Mode -- combine agentic annotation with human-LLM collaborative labeling
- Per-Turn Ratings for Dialogue -- additional per-turn rating options
- Export Formats -- export agent evaluation data
For implementation details, see the source documentation.