Skip to content
Questa pagina non è ancora disponibile nella tua lingua. Viene mostrata la versione in inglese.

Live Agent Evaluation

Watch AI agents browse the web in real time in Potato and annotate their behavior mid-execution with pause, send instructions, and manual takeover controls.

Live Agent Evaluation

New in v2.4.0

Live Agent Evaluation lets annotators watch an AI agent browse the web in real time and annotate its behavior as it runs — not after the fact. The agent takes screenshots, sends them to a vision LLM, receives actions, and executes them in a headless browser. Every step streams live to the annotator's screen.

Requirements

bash
pip install playwright anthropic
playwright install chromium
export ANTHROPIC_API_KEY=your_key_here

Configuration

yaml
live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  system_prompt: |
    You are a web browsing agent. Complete the given task efficiently.
    At each step, describe your thought, then output an action.
  max_steps: 30
  step_delay: 1.0
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true
 
instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Agent Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true

Configuration Reference

OptionTypeDefaultDescription
endpoint_typestringanthropic_visionLLM provider for the agent
ai_config.modelstringclaude-sonnet-4-20250514Model to use
ai_config.api_keystringenv varAPI key (use ${VAR} syntax)
ai_config.max_tokensint4096Max tokens per LLM response
ai_config.temperaturefloat0.3Sampling temperature
system_promptstringbuilt-inSystem prompt for the agent
max_stepsint30Maximum steps before stopping
step_delayfloat1.0Seconds between steps
viewport.widthint1280Browser viewport width
viewport.heightint720Browser viewport height
allow_takeoverbooltrueLet annotators take manual control
allow_instructionsbooltrueLet annotators send mid-run instructions
history_windowint5Number of recent steps included in LLM context

Data Format

Each instance provides the task and starting URL:

json
{
  "id": "task_001",
  "task_description": "Search for climate change on Wikipedia and find the year it was first described",
  "start_url": "https://en.wikipedia.org"
}

Annotator Workflow

  1. The annotator reads the task description and clicks Start Agent
  2. A headless Chromium browser launches and connects to the LLM
  3. Screenshots stream live to the viewer as the agent navigates — each step shows the screenshot, the agent's thought, and the action taken
  4. The annotator can interact using the control panel:
    • Pause / Resume — halt the agent between steps
    • Send Instructions — inject a message into the agent's context mid-run
    • Take Over — switch to manual browsing control
    • Stop — end the session early
  5. When the session finishes (success, failure, or max_steps reached), the trace is saved and the display switches to review mode
  6. The annotator fills in the annotation schemes to evaluate the agent's performance

Keyboard Shortcuts

KeyAction
SpacePause / Resume
EscapeStop session

Adding Annotation Schemes

Combine live agent display with any Potato annotation schemes:

yaml
annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes, fully"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "How efficiently did the agent work?"
    min_label: "Very inefficient"
    max_label: "Very efficient"
    scale: 5
  - annotation_type: text
    name: errors_observed
    question: "Describe any errors or unnecessary steps"
  - annotation_type: span
    name: error_steps
    question: "Mark any steps where the agent made an error"
    labels:
      - name: hallucination
      - name: wrong_target
      - name: unnecessary_action

Full Example

yaml
task_name: "Live Agent Evaluation Study"
task_dir: "."
 
live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  max_steps: 25
  step_delay: 1.5
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true
  history_window: 5
 
data_files:
  - "tasks.jsonl"
 
instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true
 
annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "Rate the agent's efficiency"
    scale: 5
    min_label: "Very inefficient"
    max_label: "Very efficient"
  - annotation_type: text
    name: notes
    question: "Notes on agent behavior"
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

Architecture

The live agent runs as a background thread in Flask. Screenshots and state changes are streamed to the browser via Server-Sent Events (SSE). The annotator controls (pause, instruct, takeover, stop) call REST endpoints that synchronize with the background thread.

text
Annotator (browser)  <── SSE stream ──  Flask Server  ── Playwright ──► Headless Browser
                     ──► REST control ─►              ◄── LLM API ────► Claude Vision

Screenshots are saved to {task_dir}/live_sessions/ and served via the API for the filmstrip view.

Trace Export

When a session completes, Potato automatically exports the full trace as web_agent_trace-compatible JSON, including:

  • All steps with screenshots, actions, thoughts, and observations
  • Any mid-run instructions sent by the annotator
  • Timestamps and agent configuration metadata
  • Annotator takeover events

This means completed live sessions can be reviewed later using the standard Web Agent Annotation viewer.

Troubleshooting

"Playwright is not installed" — Run pip install playwright && playwright install chromium.

"Anthropic API key required" — Set the ANTHROPIC_API_KEY environment variable or use api_key: ${ANTHROPIC_API_KEY} in your config.

Agent seems slow — Each step requires an LLM API call (typically 3–10 seconds). The thinking indicator appears while the LLM processes. Reduce history_window to speed up long sessions.

Screenshots not loading — Check that task_dir is writable and the server has available disk space.

Further Reading

For implementation details, see the source documentation.