Live Agent Evaluation
Watch AI agents browse the web in real time in Potato and annotate their behavior mid-execution with pause, send instructions, and manual takeover controls.
Live Agent Evaluation
New in v2.4.0
Live Agent Evaluation lets annotators watch an AI agent browse the web in real time and annotate its behavior as it runs — not after the fact. The agent takes screenshots, sends them to a vision LLM, receives actions, and executes them in a headless browser. Every step streams live to the annotator's screen.
Requirements
pip install playwright anthropic
playwright install chromium
export ANTHROPIC_API_KEY=your_key_hereConfiguration
live_agent:
endpoint_type: anthropic_vision
ai_config:
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
max_tokens: 4096
temperature: 0.3
system_prompt: |
You are a web browsing agent. Complete the given task efficiently.
At each step, describe your thought, then output an action.
max_steps: 30
step_delay: 1.0
viewport:
width: 1280
height: 720
allow_takeover: true
allow_instructions: true
instance_display:
fields:
- key: task_description
type: text
label: "Task"
- key: agent_trace
type: live_agent
label: "Live Agent Session"
display_options:
show_overlays: true
show_filmstrip: true
show_thought: true
show_controls: trueConfiguration Reference
| Option | Type | Default | Description |
|---|---|---|---|
endpoint_type | string | anthropic_vision | LLM provider for the agent |
ai_config.model | string | claude-sonnet-4-20250514 | Model to use |
ai_config.api_key | string | env var | API key (use ${VAR} syntax) |
ai_config.max_tokens | int | 4096 | Max tokens per LLM response |
ai_config.temperature | float | 0.3 | Sampling temperature |
system_prompt | string | built-in | System prompt for the agent |
max_steps | int | 30 | Maximum steps before stopping |
step_delay | float | 1.0 | Seconds between steps |
viewport.width | int | 1280 | Browser viewport width |
viewport.height | int | 720 | Browser viewport height |
allow_takeover | bool | true | Let annotators take manual control |
allow_instructions | bool | true | Let annotators send mid-run instructions |
history_window | int | 5 | Number of recent steps included in LLM context |
Data Format
Each instance provides the task and starting URL:
{
"id": "task_001",
"task_description": "Search for climate change on Wikipedia and find the year it was first described",
"start_url": "https://en.wikipedia.org"
}Annotator Workflow
- The annotator reads the task description and clicks Start Agent
- A headless Chromium browser launches and connects to the LLM
- Screenshots stream live to the viewer as the agent navigates — each step shows the screenshot, the agent's thought, and the action taken
- The annotator can interact using the control panel:
- Pause / Resume — halt the agent between steps
- Send Instructions — inject a message into the agent's context mid-run
- Take Over — switch to manual browsing control
- Stop — end the session early
- When the session finishes (success, failure, or
max_stepsreached), the trace is saved and the display switches to review mode - The annotator fills in the annotation schemes to evaluate the agent's performance
Keyboard Shortcuts
| Key | Action |
|---|---|
Space | Pause / Resume |
Escape | Stop session |
Adding Annotation Schemes
Combine live agent display with any Potato annotation schemes:
annotation_schemes:
- annotation_type: radio
name: task_success
question: "Did the agent complete the task?"
labels:
- name: "Yes, fully"
- name: "Partially"
- name: "No"
- annotation_type: likert
name: efficiency
question: "How efficiently did the agent work?"
min_label: "Very inefficient"
max_label: "Very efficient"
scale: 5
- annotation_type: text
name: errors_observed
question: "Describe any errors or unnecessary steps"
- annotation_type: span
name: error_steps
question: "Mark any steps where the agent made an error"
labels:
- name: hallucination
- name: wrong_target
- name: unnecessary_actionFull Example
task_name: "Live Agent Evaluation Study"
task_dir: "."
live_agent:
endpoint_type: anthropic_vision
ai_config:
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
max_tokens: 4096
temperature: 0.3
max_steps: 25
step_delay: 1.5
viewport:
width: 1280
height: 720
allow_takeover: true
allow_instructions: true
history_window: 5
data_files:
- "tasks.jsonl"
instance_display:
fields:
- key: task_description
type: text
label: "Task"
- key: agent_trace
type: live_agent
label: "Live Session"
display_options:
show_overlays: true
show_filmstrip: true
show_thought: true
show_controls: true
annotation_schemes:
- annotation_type: radio
name: task_success
question: "Did the agent complete the task?"
labels:
- name: "Yes"
- name: "Partially"
- name: "No"
- annotation_type: likert
name: efficiency
question: "Rate the agent's efficiency"
scale: 5
min_label: "Very inefficient"
max_label: "Very efficient"
- annotation_type: text
name: notes
question: "Notes on agent behavior"
output_annotation_dir: "output/"
output_annotation_format: "jsonl"Architecture
The live agent runs as a background thread in Flask. Screenshots and state changes are streamed to the browser via Server-Sent Events (SSE). The annotator controls (pause, instruct, takeover, stop) call REST endpoints that synchronize with the background thread.
Annotator (browser) <── SSE stream ── Flask Server ── Playwright ──► Headless Browser
──► REST control ─► ◄── LLM API ────► Claude Vision
Screenshots are saved to {task_dir}/live_sessions/ and served via the API for the filmstrip view.
Trace Export
When a session completes, Potato automatically exports the full trace as web_agent_trace-compatible JSON, including:
- All steps with screenshots, actions, thoughts, and observations
- Any mid-run instructions sent by the annotator
- Timestamps and agent configuration metadata
- Annotator takeover events
This means completed live sessions can be reviewed later using the standard Web Agent Annotation viewer.
Troubleshooting
"Playwright is not installed" — Run pip install playwright && playwright install chromium.
"Anthropic API key required" — Set the ANTHROPIC_API_KEY environment variable or use api_key: ${ANTHROPIC_API_KEY} in your config.
Agent seems slow — Each step requires an LLM API call (typically 3–10 seconds). The thinking indicator appears while the LLM processes. Reduce history_window to speed up long sessions.
Screenshots not loading — Check that task_dir is writable and the server has available disk space.
Further Reading
- Web Agent Annotation — review pre-recorded agent traces
- Agentic Annotation — overview of agent trace formats and converters
- AI Support — LLM integration for annotation assistance
For implementation details, see the source documentation.