Question 1

Can I evaluate traces from coding agents like Claude Code, Cursor, or SWE-Agent?

Accepted Answer

Yes. Potato has native trace converters for Claude Code, OpenCode, Cursor, Aider, and SWE-Agent. Tool calls render with purpose-built UI: red/green unified diff view for Edit/Write, dark monospace terminal blocks for Bash, line-numbered code for Read/Grep, and a file tree sidebar that groups all touched files by operation. Long outputs auto-collapse.

Question 2

Can I evaluate web-browsing agents?

Accepted Answer

Yes. Potato includes a Web Agent display with SVG overlays for click markers, bounding boxes, mouse paths, and scroll indicators. Two modes: Review Mode for filmstrip navigation through pre-recorded screenshots, and Creation Mode for iframe-based live web browsing with automatic interaction recording. Trace converters ship for WebArena, Mind2Web, and Anthropic Computer Use formats.

Question 3

Can I evaluate multi-agent systems with several cooperating agents?

Accepted Answer

Yes. Potato renders a multi-agent run as a clickable interaction graph of agents and handoffs, and adds schemas to attribute a failure to the responsible agent and step, review every handoff for inter-agent misalignment, score each agent and the team, and tag tool contention and emergent behavior across agents. See the multi-agent team evaluation docs.

Question 4

Can I evaluate computer-use, voice, or video agents?

Accepted Answer

Yes. Potato has purpose-built schemas for multimodal agents: GUI/computer-use trajectories with per-step screenshots and click grounding, full-duplex voice timelines with barge-in detection, video temporal grounding with a live IoU against the model's prediction, aligned speech-transcript error tagging, interleaved multimodal reasoning, and document table-grid structure. See the multimodal-agent evaluation docs.

Question 5

Can annotators watch an AI agent browse the web in real time?

Accepted Answer

Yes. Live Agent mode connects an LLM vision model (Anthropic Claude via Playwright) to a headless browser. The agent takes screenshots, the LLM plans actions, and Potato streams the session to the annotator via Server-Sent Events. Annotators can pause, send instructions, or take over manual control mid-session. Configure via the `live_agent` display type.

Question 6

Can I rewind, branch, or replay an agent session during evaluation?

Accepted Answer

Yes. Coding agent mode supports checkpoint/rollback at any step and branching/replay for exploring alternate trajectories. Useful for counterfactual evaluation, A/B comparison between agent decisions, and capturing high-quality training data where annotators iteratively refine an agent run.

Question 7

Can I annotate errors at the individual step level of an agent trajectory?

Accepted Answer

Yes. The trajectory_eval schema (based on TRAIL and AgentRewardBench) displays each step as a card. Annotators mark correctness, classify error types from a configurable taxonomy with subtypes (reasoning, execution, safety, etc.), assign severity with weighted scores, and write per-step rationales. An auto-computed quality score aggregates severity penalties across the trajectory.

Question 8

Can I collect process reward model (PRM) and code review training data?

Accepted Answer

Yes. Potato ships process reward and code review schemas for step-level evaluation of coding agents. Both annotation types export directly to PRM and DPO formats for downstream RLHF training. See the coding-agent-evaluation example project.

Question 9

Can annotators ask an LLM for help while evaluating an agent?

Accepted Answer

Yes. The LLM Chat Sidebar is a collapsible AI assistant panel with multi-turn conversation. It receives the task description, label set, and current instance text as context. Native multi-turn support for OpenAI, Anthropic, and Ollama. All conversations are logged as behavioral data for later analysis of annotator-LLM collaboration.

Question 10

Can I use Potato with agents built on LangChain?

Accepted Answer

Yes. Potato converts LangChain/LangSmith traces automatically. You can also set up real-time trace ingestion via webhook — new traces appear in the annotator queue as they're generated.

Question 11

Can I capture agent traces automatically from my LangChain app?

Accepted Answer

Yes. Install `pip install potato-annotation[langchain]` and attach `PotatoCallbackHandler` to your chain. It tracks parent-child chain/LLM/tool runs and sends LangSmith-compatible payloads to Potato on root completion. Combined with the webhook receiver, you can ingest live agent traces into annotation queues without manual export.

Question 12

Which agent trace formats does Potato support out of the box?

Accepted Answer

Thirteen formats across three categories. **Frameworks**: LangChain, LangFuse, OpenAI, Anthropic, MCP (Model Context Protocol), OpenTelemetry, ATIF. **Web agents**: WebArena, raw web traces. **Coding agents**: Claude Code, Aider, SWE-Agent. Plus a generic JSONL ingestion path with `structured_turns` schema for any custom format. See /integrations for the full list.

Question 13

Can I combine multiple evaluation schemas in a single agent annotation task?

Accepted Answer

Yes. A coding-agent project can layer trajectory_eval (per-step errors), span annotation (highlight hallucinations in agent reasoning), pairwise comparison (which agent did better), and likert ratings (overall quality) on the same trace. Potato's multi-schema architecture means annotators see all schemas in one interface for the same trace.

Question 14

Do I need a GPU or API key for live agent evaluation?

Accepted Answer

No. The live agent supports Ollama for fully local inference with no API key. Use any Ollama-compatible model with vision support. For coding agents, any Ollama model works.

Question 15

Can I evaluate multi-agent systems?

Accepted Answer

Yes. Potato supports CrewAI, AutoGen, and LangGraph trace formats. The multi-agent evaluation example shows how to assess agent coordination, redundant work, and communication quality.

Question 16

What if my agent framework isn't listed?

Accepted Answer

Use the generic ReAct converter (thought/action/observation format) or the webhook API to send traces in any JSON format. Potato auto-detects common structures. You can also write a custom converter in Python.

Question 17

Can annotators interact with agents during evaluation?

Accepted Answer

Yes. Live agent mode lets annotators pause the agent, send text instructions, or take over manual control. For coding agents, annotators can rollback to any checkpoint and branch with different instructions.

Question 18

How do I export agent annotations for training?

Accepted Answer

Use the agent_eval exporter: `python -m potato.export -f agent_eval -o results/`. For PRM data, use `-f prm`. For DPO/RLHF preference pairs, use `-f dpo`. Export produces JSON/CSV format.

Agent Evaluation

Agent Evaluation

Still Have Questions?