Live Agent Evaluation

Name: Potato
Author: Potato Annotation

Watch AI agents work in real time and annotate their behavior mid-execution with pause, instruct, and takeover controls. Supports web and coding agents with Anthropic, Ollama, and Claude SDK.

New in v2.4.0

Live Agent Evaluation lets annotators watch an AI agent browse the web in real time and annotate its behavior as it runs — not after the fact. The agent takes screenshots, sends them to a vision LLM, receives actions, and executes them in a headless browser. Every step streams live to the annotator's screen.

Requirements

bash

pip install playwright anthropic
playwright install chromium
export ANTHROPIC_API_KEY=your_key_here

Configuration

yaml

live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  system_prompt: |
    You are a web browsing agent. Complete the given task efficiently.
    At each step, describe your thought, then output an action.
  max_steps: 30
  step_delay: 1.0
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true
 
instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Agent Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true

Configuration Reference

Option	Type	Default	Description
`endpoint_type`	string	`anthropic_vision`	LLM provider for the agent
`ai_config.model`	string	`claude-sonnet-4-20250514`	Model to use
`ai_config.api_key`	string	env var	API key (use `${VAR}` syntax)
`ai_config.max_tokens`	int	`4096`	Max tokens per LLM response
`ai_config.temperature`	float	`0.3`	Sampling temperature
`system_prompt`	string	built-in	System prompt for the agent
`max_steps`	int	`30`	Maximum steps before stopping
`step_delay`	float	`1.0`	Seconds between steps
`viewport.width`	int	`1280`	Browser viewport width
`viewport.height`	int	`720`	Browser viewport height
`allow_takeover`	bool	`true`	Let annotators take manual control
`allow_instructions`	bool	`true`	Let annotators send mid-run instructions
`history_window`	int	`5`	Number of recent steps included in LLM context

Data Format

Each instance provides the task and starting URL:

json

{
  "id": "task_001",
  "task_description": "Search for climate change on Wikipedia and find the year it was first described",
  "start_url": "https://en.wikipedia.org"
}

Annotator Workflow

The annotator reads the task description and clicks Start Agent
A headless Chromium browser launches and connects to the LLM
Screenshots stream live to the viewer as the agent navigates — each step shows the screenshot, the agent's thought, and the action taken
The annotator can interact using the control panel:
- Pause / Resume — halt the agent between steps
- Send Instructions — inject a message into the agent's context mid-run
- Take Over — switch to manual browsing control
- Stop — end the session early
When the session finishes (success, failure, or max_steps reached), the trace is saved and the display switches to review mode
The annotator fills in the annotation schemes to evaluate the agent's performance

Keyboard Shortcuts

Key	Action
`Space`	Pause / Resume
`Escape`	Stop session

Adding Annotation Schemes

Combine live agent display with any Potato annotation schemes:

yaml

annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes, fully"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "How efficiently did the agent work?"
    min_label: "Very inefficient"
    max_label: "Very efficient"
    scale: 5
  - annotation_type: text
    name: errors_observed
    question: "Describe any errors or unnecessary steps"
  - annotation_type: span
    name: error_steps
    question: "Mark any steps where the agent made an error"
    labels:
      - name: hallucination
      - name: wrong_target
      - name: unnecessary_action

Full Example

yaml

task_name: "Live Agent Evaluation Study"
task_dir: "."
 
live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  max_steps: 25
  step_delay: 1.5
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true
  history_window: 5
 
data_files:
  - "tasks.jsonl"
 
instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true
 
annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "Rate the agent's efficiency"
    scale: 5
    min_label: "Very inefficient"
    max_label: "Very efficient"
  - annotation_type: text
    name: notes
    question: "Notes on agent behavior"
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

Architecture

The live agent runs as a background thread in Flask. Screenshots and state changes are streamed to the browser via Server-Sent Events (SSE). The annotator controls (pause, instruct, takeover, stop) call REST endpoints that synchronize with the background thread.

text

Annotator (browser)  <── SSE stream ──  Flask Server  ── Playwright ──► Headless Browser
                     ──► REST control ─►              ◄── LLM API ────► Claude Vision

Screenshots are saved to {task_dir}/live_sessions/ and served via the API for the filmstrip view.

Trace Export

When a session completes, Potato automatically exports the full trace as web_agent_trace-compatible JSON, including:

All steps with screenshots, actions, thoughts, and observations
Any mid-run instructions sent by the annotator
Timestamps and agent configuration metadata
Annotator takeover events

This means completed live sessions can be reviewed later using the standard Web Agent Annotation viewer.

Troubleshooting

"Playwright is not installed" — Run pip install playwright && playwright install chromium.

"Anthropic API key required" — Set the ANTHROPIC_API_KEY environment variable or use api_key: ${ANTHROPIC_API_KEY} in your config.

Agent seems slow — Each step requires an LLM API call (typically 3–10 seconds). The thinking indicator appears while the LLM processes. Reduce history_window to speed up long sessions.

Screenshots not loading — Check that task_dir is writable and the server has available disk space.

Coding Agent Backends

In addition to web browsing agents, Potato supports live observation of coding agents. Three backends are available:

Ollama (Local, No API Key)

Run coding agent evaluation with fully local models — no API key needed.

yaml

live_agent:
  endpoint_type: coding_agent
  backend: ollama
  ai_config:
    model: qwen2.5-coder:7b
    host: "http://localhost:11434"
  max_steps: 50
  project_dir: "./workspace"

Anthropic API

Use Claude with tool use for coding agent evaluation.

yaml

live_agent:
  endpoint_type: coding_agent
  backend: anthropic
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 8192
  max_steps: 50
  project_dir: "./workspace"

Claude Agent SDK

Full Claude Code capabilities for advanced coding agent sessions.

yaml

live_agent:
  endpoint_type: coding_agent
  backend: claude_agent_sdk
  ai_config:
    max_turns: 50
  project_dir: "./workspace"

See Live Coding Agent for the complete reference including rollback, branching, and trajectory export.

Rollback and Checkpoints

For coding agent sessions, Potato creates a git commit after every file change. This enables:

One-click rollback to any previous checkpoint
Branch and replay — try a different approach from any checkpoint
Full history of every file state for review

Checkpoints are managed automatically via a dedicated git branch per session.

Branching Trajectories

When an annotator rolls back and tries a different approach, Potato creates a branching trajectory. Both branches are preserved in the output, producing training data for:

Process Reward Models — per-step correctness labels across branches
Preference Learning — which branch produced better results
Code Review Datasets — compare code quality across approaches

Live Agent Evaluation

Requirements

Configuration

Configuration Reference

Data Format

Annotator Workflow

Keyboard Shortcuts

Adding Annotation Schemes

Full Example

Architecture

Trace Export

Troubleshooting

Coding Agent Backends

Ollama (Local, No API Key)

Anthropic API

Claude Agent SDK

Rollback and Checkpoints

Branching Trajectories

Further Reading