Agent Evaluation
Find answers to common questions about Potato. Can't find what you're looking for? Join our Discord or check the documentation.
Agent Evaluation
Yes. Potato has native trace converters for Claude Code, OpenCode, Cursor, Aider, and SWE-Agent. Tool calls render with purpose-built UI: red/green unified diff view for Edit/Write, dark monospace terminal blocks for Bash, line-numbered code for Read/Grep, and a file tree sidebar that groups all touched files by operation. Long outputs auto-collapse.
Yes. Potato includes a Web Agent display with SVG overlays for click markers, bounding boxes, mouse paths, and scroll indicators. Two modes: Review Mode for filmstrip navigation through pre-recorded screenshots, and Creation Mode for iframe-based live web browsing with automatic interaction recording. Trace converters ship for WebArena, Mind2Web, and Anthropic Computer Use formats.
Yes. Live Agent mode connects an LLM vision model (Anthropic Claude via Playwright) to a headless browser. The agent takes screenshots, the LLM plans actions, and Potato streams the session to the annotator via Server-Sent Events. Annotators can pause, send instructions, or take over manual control mid-session. Configure via the `live_agent` display type.
Yes. Coding agent mode supports checkpoint/rollback at any step and branching/replay for exploring alternate trajectories. Useful for counterfactual evaluation, A/B comparison between agent decisions, and capturing high-quality training data where annotators iteratively refine an agent run.
Yes. The trajectory_eval schema (based on TRAIL and AgentRewardBench) displays each step as a card. Annotators mark correctness, classify error types from a configurable taxonomy with subtypes (reasoning, execution, safety, etc.), assign severity with weighted scores, and write per-step rationales. An auto-computed quality score aggregates severity penalties across the trajectory.
Yes. Potato ships process reward and code review schemas for step-level evaluation of coding agents. Both annotation types export directly to PRM and DPO formats for downstream RLHF training. See the coding-agent-evaluation example project.
Yes. The LLM Chat Sidebar is a collapsible AI assistant panel with multi-turn conversation. It receives the task description, label set, and current instance text as context. Native multi-turn support for OpenAI, Anthropic, and Ollama. All conversations are logged as behavioral data for later analysis of annotator-LLM collaboration.
Yes. Potato converts LangChain/LangSmith traces automatically. You can also set up real-time trace ingestion via webhook — new traces appear in the annotator queue as they're generated.
Yes. Install `pip install potato-annotation[langchain]` and attach `PotatoCallbackHandler` to your chain. It tracks parent-child chain/LLM/tool runs and sends LangSmith-compatible payloads to Potato on root completion. Combined with the webhook receiver, you can ingest live agent traces into annotation queues without manual export.
Thirteen formats across three categories. **Frameworks**: LangChain, LangFuse, OpenAI, Anthropic, MCP (Model Context Protocol), OpenTelemetry, ATIF. **Web agents**: WebArena, raw web traces. **Coding agents**: Claude Code, Aider, SWE-Agent. Plus a generic JSONL ingestion path with `structured_turns` schema for any custom format. See /integrations for the full list.
Yes. A coding-agent project can layer trajectory_eval (per-step errors), span annotation (highlight hallucinations in agent reasoning), pairwise comparison (which agent did better), and likert ratings (overall quality) on the same trace. Potato's multi-schema architecture means annotators see all schemas in one interface for the same trace.
No. The live agent supports Ollama for fully local inference with no API key. Use any Ollama-compatible model with vision support. For coding agents, any Ollama model works.
Yes. Potato supports CrewAI, AutoGen, and LangGraph trace formats. The multi-agent evaluation example shows how to assess agent coordination, redundant work, and communication quality.
Use the generic ReAct converter (thought/action/observation format) or the webhook API to send traces in any JSON format. Potato auto-detects common structures. You can also write a custom converter in Python.
Yes. Live agent mode lets annotators pause the agent, send text instructions, or take over manual control. For coding agents, annotators can rollback to any checkpoint and branch with different instructions.
Use the agent_eval exporter: `python -m potato.export -f agent_eval -o results/`. For PRM data, use `-f prm`. For DPO/RLHF preference pairs, use `-f dpo`. Export produces JSON/CSV format.
Still Have Questions?
Our community is here to help. Join Discord for real-time support or browse the documentation for detailed guides.