Live Agent Evaluation
Watch AI agents work in real time and annotate their behavior mid-execution with pause, instruct, and takeover controls. Supports web and coding agents with Anthropic, Ollama, and Claude SDK.
Live Agent Evaluation
New in v2.4.0
Live Agent Evaluation lets annotators watch an AI agent browse the web in real time and annotate its behavior as it runs — not after the fact. The agent takes screenshots, sends them to a vision LLM, receives actions, and executes them in a headless browser. Every step streams live to the annotator's screen.
Requirements
pip install playwright anthropic
playwright install chromium
export ANTHROPIC_API_KEY=your_key_hereConfiguration
live_agent:
endpoint_type: anthropic_vision
ai_config:
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
max_tokens: 4096
temperature: 0.3
system_prompt: |
You are a web browsing agent. Complete the given task efficiently.
At each step, describe your thought, then output an action.
max_steps: 30
step_delay: 1.0
viewport:
width: 1280
height: 720
allow_takeover: true
allow_instructions: true
instance_display:
fields:
- key: task_description
type: text
label: "Task"
- key: agent_trace
type: live_agent
label: "Live Agent Session"
display_options:
show_overlays: true
show_filmstrip: true
show_thought: true
show_controls: trueConfiguration Reference
| Option | Type | Default | Description |
|---|---|---|---|
endpoint_type | string | anthropic_vision | LLM provider for the agent |
ai_config.model | string | claude-sonnet-4-20250514 | Model to use |
ai_config.api_key | string | env var | API key (use ${VAR} syntax) |
ai_config.max_tokens | int | 4096 | Max tokens per LLM response |
ai_config.temperature | float | 0.3 | Sampling temperature |
system_prompt | string | built-in | System prompt for the agent |
max_steps | int | 30 | Maximum steps before stopping |
step_delay | float | 1.0 | Seconds between steps |
viewport.width | int | 1280 | Browser viewport width |
viewport.height | int | 720 | Browser viewport height |
allow_takeover | bool | true | Let annotators take manual control |
allow_instructions | bool | true | Let annotators send mid-run instructions |
history_window | int | 5 | Number of recent steps included in LLM context |
Data Format
Each instance provides the task and starting URL:
{
"id": "task_001",
"task_description": "Search for climate change on Wikipedia and find the year it was first described",
"start_url": "https://en.wikipedia.org"
}Annotator Workflow
- The annotator reads the task description and clicks Start Agent
- A headless Chromium browser launches and connects to the LLM
- Screenshots stream live to the viewer as the agent navigates — each step shows the screenshot, the agent's thought, and the action taken
- The annotator can interact using the control panel:
- Pause / Resume — halt the agent between steps
- Send Instructions — inject a message into the agent's context mid-run
- Take Over — switch to manual browsing control
- Stop — end the session early
- When the session finishes (success, failure, or
max_stepsreached), the trace is saved and the display switches to review mode - The annotator fills in the annotation schemes to evaluate the agent's performance
Keyboard Shortcuts
| Key | Action |
|---|---|
Space | Pause / Resume |
Escape | Stop session |
Adding Annotation Schemes
Combine live agent display with any Potato annotation schemes:
annotation_schemes:
- annotation_type: radio
name: task_success
question: "Did the agent complete the task?"
labels:
- name: "Yes, fully"
- name: "Partially"
- name: "No"
- annotation_type: likert
name: efficiency
question: "How efficiently did the agent work?"
min_label: "Very inefficient"
max_label: "Very efficient"
scale: 5
- annotation_type: text
name: errors_observed
question: "Describe any errors or unnecessary steps"
- annotation_type: span
name: error_steps
question: "Mark any steps where the agent made an error"
labels:
- name: hallucination
- name: wrong_target
- name: unnecessary_actionFull Example
task_name: "Live Agent Evaluation Study"
task_dir: "."
live_agent:
endpoint_type: anthropic_vision
ai_config:
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
max_tokens: 4096
temperature: 0.3
max_steps: 25
step_delay: 1.5
viewport:
width: 1280
height: 720
allow_takeover: true
allow_instructions: true
history_window: 5
data_files:
- "tasks.jsonl"
instance_display:
fields:
- key: task_description
type: text
label: "Task"
- key: agent_trace
type: live_agent
label: "Live Session"
display_options:
show_overlays: true
show_filmstrip: true
show_thought: true
show_controls: true
annotation_schemes:
- annotation_type: radio
name: task_success
question: "Did the agent complete the task?"
labels:
- name: "Yes"
- name: "Partially"
- name: "No"
- annotation_type: likert
name: efficiency
question: "Rate the agent's efficiency"
scale: 5
min_label: "Very inefficient"
max_label: "Very efficient"
- annotation_type: text
name: notes
question: "Notes on agent behavior"
output_annotation_dir: "output/"
output_annotation_format: "jsonl"Architecture
The live agent runs as a background thread in Flask. Screenshots and state changes are streamed to the browser via Server-Sent Events (SSE). The annotator controls (pause, instruct, takeover, stop) call REST endpoints that synchronize with the background thread.
Annotator (browser) <── SSE stream ── Flask Server ── Playwright ──► Headless Browser
──► REST control ─► ◄── LLM API ────► Claude Vision
Screenshots are saved to {task_dir}/live_sessions/ and served via the API for the filmstrip view.
Trace Export
When a session completes, Potato automatically exports the full trace as web_agent_trace-compatible JSON, including:
- All steps with screenshots, actions, thoughts, and observations
- Any mid-run instructions sent by the annotator
- Timestamps and agent configuration metadata
- Annotator takeover events
This means completed live sessions can be reviewed later using the standard Web Agent Annotation viewer.
Troubleshooting
"Playwright is not installed" — Run pip install playwright && playwright install chromium.
"Anthropic API key required" — Set the ANTHROPIC_API_KEY environment variable or use api_key: ${ANTHROPIC_API_KEY} in your config.
Agent seems slow — Each step requires an LLM API call (typically 3–10 seconds). The thinking indicator appears while the LLM processes. Reduce history_window to speed up long sessions.
Screenshots not loading — Check that task_dir is writable and the server has available disk space.
Coding Agent Backends
In addition to web browsing agents, Potato supports live observation of coding agents. Three backends are available:
Ollama (Local, No API Key)
Run coding agent evaluation with fully local models — no API key needed.
live_agent:
endpoint_type: coding_agent
backend: ollama
ai_config:
model: qwen2.5-coder:7b
host: "http://localhost:11434"
max_steps: 50
project_dir: "./workspace"Anthropic API
Use Claude with tool use for coding agent evaluation.
live_agent:
endpoint_type: coding_agent
backend: anthropic
ai_config:
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
max_tokens: 8192
max_steps: 50
project_dir: "./workspace"Claude Agent SDK
Full Claude Code capabilities for advanced coding agent sessions.
live_agent:
endpoint_type: coding_agent
backend: claude_agent_sdk
ai_config:
max_turns: 50
project_dir: "./workspace"See Live Coding Agent for the complete reference including rollback, branching, and trajectory export.
Rollback and Checkpoints
For coding agent sessions, Potato creates a git commit after every file change. This enables:
- One-click rollback to any previous checkpoint
- Branch and replay — try a different approach from any checkpoint
- Full history of every file state for review
Checkpoints are managed automatically via a dedicated git branch per session.
Branching Trajectories
When an annotator rolls back and tries a different approach, Potato creates a branching trajectory. Both branches are preserved in the output, creating rich training data for:
- Process Reward Models — per-step correctness labels across branches
- Preference Learning — which branch produced better results
- Code Review Datasets — compare code quality across approaches
Further Reading
- Live Coding Agent — coding agent observation with Ollama, Anthropic, and Claude SDK
- Web Agent Annotation — review pre-recorded agent traces
- Agentic Annotation — overview of agent trace formats and converters
- Process Reward Annotation — PRM training data collection
- AI Support — LLM integration for annotation assistance
For implementation details, see the source documentation.