# Live Agent Evaluation

Source: https://www.potatoannotator.com/docs/guides/live-agent-evaluation

**Most agent evaluation reviews a recorded trace. Live evaluation watches an agent run in real time and lets a human intervene, pausing it, sending instructions, taking control, or rolling back to try a different path.** It captures things a recording can't: where a person *would* have stepped in, and what better guidance looks like.

For the feature reference, see [Live Agent Evaluation](/docs/features/live-agent-evaluation) and [Live Coding Agent](/docs/features/live-coding-agent).

## What live evaluation adds

- **Pause and resume**: stop the agent mid-task to inspect its state.
- **Send instructions**: nudge it with guidance and observe how it adapts.
- **Take over**: drive manually, then hand control back. The handoff points are valuable labels.
- **Rollback and branch**: return to an earlier step and try an alternative, comparing paths from the same state.

This produces *interventional* data, counterfactuals about what helps, not just observational labels.

## Setting it up

Live mode connects Potato to a running agent through an endpoint (an OpenAI-compatible proxy, a custom HTTP endpoint, or a coding-agent backend). The annotator interacts through the live agent display.

```yaml
live_agent:
  endpoint_type: anthropic_vision   # or coding_agent, openai_proxy, ...
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
  max_steps: 30
  allow_takeover: true
  allow_instructions: true
```

## When to use it

- **Building guidelines**: watching live reveals the failure modes worth encoding into a taxonomy for later batch labeling.
- **Interactive tasks**: chat assistants and tool-using agents where the *interaction*, not just the transcript, is what you're judging.
- **Stress testing**: probing how an agent recovers from a nudge or a forced detour.

Live evaluation is higher-touch and lower-throughput than reviewing recorded traces, so it's best for a focused sample or for designing the batch task. For volume, switch to [trajectory annotation](/docs/guides/agent-trajectory-annotation) over recorded runs.

## Further reading

- [Live Agent Evaluation feature reference](/docs/features/live-agent-evaluation)
- [Web-Agent Evaluation](/docs/guides/web-agent-evaluation)
- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents)
