# Live Agent Evaluation

Source: https://www.potatoannotator.com/docs/features/live-agent-evaluation

*New in v2.4.0*

Live Agent Evaluation lets annotators watch an AI agent browse the web in real time and annotate its behavior as it runs — not after the fact. The agent takes screenshots, sends them to a vision LLM, receives actions, and executes them in a headless browser. Every step streams live to the annotator's screen.

## Requirements

```bash
pip install playwright anthropic
playwright install chromium
export ANTHROPIC_API_KEY=your_key_here
```

## Configuration

```yaml
live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  system_prompt: |
    You are a web browsing agent. Complete the given task efficiently.
    At each step, describe your thought, then output an action.
  max_steps: 30
  step_delay: 1.0
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true

instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Agent Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true
```

## Configuration Reference

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `endpoint_type` | string | `anthropic_vision` | LLM provider for the agent |
| `ai_config.model` | string | `claude-sonnet-4-20250514` | Model to use |
| `ai_config.api_key` | string | env var | API key (use `${VAR}` syntax) |
| `ai_config.max_tokens` | int | `4096` | Max tokens per LLM response |
| `ai_config.temperature` | float | `0.3` | Sampling temperature |
| `system_prompt` | string | built-in | System prompt for the agent |
| `max_steps` | int | `30` | Maximum steps before stopping |
| `step_delay` | float | `1.0` | Seconds between steps |
| `viewport.width` | int | `1280` | Browser viewport width |
| `viewport.height` | int | `720` | Browser viewport height |
| `allow_takeover` | bool | `true` | Let annotators take manual control |
| `allow_instructions` | bool | `true` | Let annotators send mid-run instructions |
| `history_window` | int | `5` | Number of recent steps included in LLM context |

## Data Format

Each instance provides the task and starting URL:

```json
{
  "id": "task_001",
  "task_description": "Search for climate change on Wikipedia and find the year it was first described",
  "start_url": "https://en.wikipedia.org"
}
```

## Annotator Workflow

1. The annotator reads the task description and clicks **Start Agent**
2. A headless Chromium browser launches and connects to the LLM
3. Screenshots stream live to the viewer as the agent navigates — each step shows the screenshot, the agent's thought, and the action taken
4. The annotator can interact using the control panel:
   - **Pause / Resume** — halt the agent between steps
   - **Send Instructions** — inject a message into the agent's context mid-run
   - **Take Over** — switch to manual browsing control
   - **Stop** — end the session early
5. When the session finishes (success, failure, or `max_steps` reached), the trace is saved and the display switches to review mode
6. The annotator fills in the annotation schemes to evaluate the agent's performance

### Keyboard Shortcuts

| Key | Action |
|-----|--------|
| `Space` | Pause / Resume |
| `Escape` | Stop session |

## Adding Annotation Schemes

Combine live agent display with any Potato annotation schemes:

```yaml
annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes, fully"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "How efficiently did the agent work?"
    min_label: "Very inefficient"
    max_label: "Very efficient"
    scale: 5
  - annotation_type: text
    name: errors_observed
    question: "Describe any errors or unnecessary steps"
  - annotation_type: span
    name: error_steps
    question: "Mark any steps where the agent made an error"
    labels:
      - name: hallucination
      - name: wrong_target
      - name: unnecessary_action
```

## Full Example

```yaml
task_name: "Live Agent Evaluation Study"
task_dir: "."

live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  max_steps: 25
  step_delay: 1.5
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true
  history_window: 5

data_files:
  - "tasks.jsonl"

instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true

annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "Rate the agent's efficiency"
    scale: 5
    min_label: "Very inefficient"
    max_label: "Very efficient"
  - annotation_type: text
    name: notes
    question: "Notes on agent behavior"

output_annotation_dir: "output/"
output_annotation_format: "jsonl"
```

## Architecture

The live agent runs as a background thread in Flask. Screenshots and state changes are streamed to the browser via Server-Sent Events (SSE). The annotator controls (pause, instruct, takeover, stop) call REST endpoints that synchronize with the background thread.

```
Annotator (browser)  <── SSE stream ──  Flask Server  ── Playwright ──► Headless Browser
                     ──► REST control ─►              ◄── LLM API ────► Claude Vision
```

Screenshots are saved to `{task_dir}/live_sessions/` and served via the API for the filmstrip view.

## Trace Export

When a session completes, Potato automatically exports the full trace as `web_agent_trace`-compatible JSON, including:

- All steps with screenshots, actions, thoughts, and observations
- Any mid-run instructions sent by the annotator
- Timestamps and agent configuration metadata
- Annotator takeover events

This means completed live sessions can be reviewed later using the standard [Web Agent Annotation](/docs/features/web-agent-annotation) viewer.

## Troubleshooting

**"Playwright is not installed"** — Run `pip install playwright && playwright install chromium`.

**"Anthropic API key required"** — Set the `ANTHROPIC_API_KEY` environment variable or use `api_key: ${ANTHROPIC_API_KEY}` in your config.

**Agent seems slow** — Each step requires an LLM API call (typically 3–10 seconds). The thinking indicator appears while the LLM processes. Reduce `history_window` to speed up long sessions.

**Screenshots not loading** — Check that `task_dir` is writable and the server has available disk space.

## Coding Agent Backends

In addition to web browsing agents, Potato supports live observation of coding agents. Three backends are available:

### Ollama (Local, No API Key)

Run coding agent evaluation with fully local models — no API key needed.

```yaml
live_agent:
  endpoint_type: coding_agent
  backend: ollama
  ai_config:
    model: qwen2.5-coder:7b
    host: "http://localhost:11434"
  max_steps: 50
  project_dir: "./workspace"
```

### Anthropic API

Use Claude with tool use for coding agent evaluation.

```yaml
live_agent:
  endpoint_type: coding_agent
  backend: anthropic
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 8192
  max_steps: 50
  project_dir: "./workspace"
```

### Claude Agent SDK

Full Claude Code capabilities for advanced coding agent sessions.

```yaml
live_agent:
  endpoint_type: coding_agent
  backend: claude_agent_sdk
  ai_config:
    max_turns: 50
  project_dir: "./workspace"
```

See [Live Coding Agent](/docs/features/live-coding-agent) for the complete reference including rollback, branching, and trajectory export.

## Rollback and Checkpoints

For coding agent sessions, Potato creates a git commit after every file change. This enables:

- **One-click rollback** to any previous checkpoint
- **Branch and replay** — try a different approach from any checkpoint
- **Full history** of every file state for review

Checkpoints are managed automatically via a dedicated git branch per session.

## Branching Trajectories

When an annotator rolls back and tries a different approach, Potato creates a branching trajectory. Both branches are preserved in the output, producing training data for:

- **Process Reward Models** — per-step correctness labels across branches
- **Preference Learning** — which branch produced better results
- **Code Review Datasets** — compare code quality across approaches

## Further Reading

- [Live Coding Agent](/docs/features/live-coding-agent) — coding agent observation with Ollama, Anthropic, and Claude SDK
- [Web Agent Annotation](/docs/features/web-agent-annotation) — review pre-recorded agent traces
- [Agentic Annotation](/docs/features/agentic-annotation) — overview of agent trace formats and converters
- [Process Reward Annotation](/docs/features/process-reward-annotation) — PRM training data collection
- [AI Support](/docs/features/ai-support) — LLM integration for annotation assistance

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/master/docs/live_agent.md).
