Skip to content
هذه الصفحة غير متوفرة بلغتك بعد. يتم عرض النسخة الإنجليزية.

Live Agent Evaluation

How to evaluate an AI agent in real time, pause, send instructions, take over, rollback, and branch, using Potato's live agent display.

Most agent evaluation reviews a recorded trace. Live evaluation watches an agent run in real time and lets a human intervene, pausing it, sending instructions, taking control, or rolling back to try a different path. It captures things a recording can't: where a person would have stepped in, and what better guidance looks like.

For the feature reference, see Live Agent Evaluation and Live Coding Agent.

What live evaluation adds

  • Pause and resume: stop the agent mid-task to inspect its state.
  • Send instructions: nudge it with guidance and observe how it adapts.
  • Take over: drive manually, then hand control back. The handoff points are valuable labels.
  • Rollback and branch: return to an earlier step and try an alternative, comparing paths from the same state.

This produces interventional data, counterfactuals about what helps, not just observational labels.

Setting it up

Live mode connects Potato to a running agent through an endpoint (an OpenAI-compatible proxy, a custom HTTP endpoint, or a coding-agent backend). The annotator interacts through the live agent display.

yaml
live_agent:
  endpoint_type: anthropic_vision   # or coding_agent, openai_proxy, ...
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
  max_steps: 30
  allow_takeover: true
  allow_instructions: true

When to use it

  • Building guidelines: watching live reveals the failure modes worth encoding into a taxonomy for later batch labeling.
  • Interactive tasks: chat assistants and tool-using agents where the interaction, not just the transcript, is what you're judging.
  • Stress testing: probing how an agent recovers from a nudge or a forced detour.

Live evaluation is higher-touch and lower-throughput than reviewing recorded traces, so it's best for a focused sample or for designing the batch task. For volume, switch to trajectory annotation over recorded runs.

Further reading