# Agentic Annotation

Source: https://www.potatoannotator.com/docs/features/agentic-annotation

*New in v2.3.0*

AI agents are increasingly deployed for complex multi-step tasks: browsing the web, writing code, calling APIs, and orchestrating sub-agents. But evaluating whether an agent *actually did the right thing* requires human judgment at a granularity that traditional annotation tools cannot support. A single agent trace may contain dozens of steps, tool calls, intermediate reasoning, screenshots, and branching decisions. Annotators need to see all of this context, navigate it efficiently, and provide structured evaluations at both the trace level and the individual step level.

Potato's agentic annotation system addresses this with four capabilities:

1. **13 trace format converters** that normalize agent logs from any major framework into a unified format
2. **5 specialized display types** optimized for different agent modalities (tool-use, web browsing, coding, chat, live observation)
3. **9 pre-built annotation schemas** covering the most common agent evaluation dimensions
4. **4 dedicated annotation types** for advanced evaluation: trajectory evaluation, rubric evaluation, pairwise comparison, and process reward annotation

## Trace Format Converters

Agent traces come in wildly different formats depending on the framework. Potato ships 13 converters that normalize these into a unified internal representation. You specify the converter in your config, or let Potato auto-detect the format.

### Converter Reference

| Converter | Source Format | Key Fields Extracted |
|-----------|--------------|---------------------|
| `openai` | OpenAI Assistants API / function calling logs | messages, tool_calls, function results |
| `anthropic` | Anthropic Claude tool_use / Messages API | content blocks, tool_use, tool_result |
| `swebench` | SWE-bench task traces | patch, test results, trajectory |
| `opentelemetry` | OpenTelemetry span exports (JSON) | spans, attributes, events, parent-child |
| `mcp` | Model Context Protocol sessions | tool definitions, call/response pairs |
| `multi_agent` | CrewAI / AutoGen / LangGraph multi-agent logs | agent roles, delegation, message passing |
| `langchain` | LangChain callback traces | chain runs, LLM calls, tool invocations |
| `langfuse` | LangFuse observation exports | generations, spans, scores |
| `react` | ReAct-style Thought/Action/Observation logs | thought, action, action_input, observation |
| `webarena` | WebArena / VisualWebArena trace JSON | actions, screenshots, DOM snapshots, URLs |
| `atif` | Agent Trace Interchange Format (ATIF) | steps, observations, metadata |
| `raw_web` | Raw browser recordings (HAR + screenshots) | requests, responses, screenshots, timings |
| `claude_code` | Claude Code / Aider / coding agents | tool_use blocks, diffs, terminal output |

### Configuration

Specify the converter in your project config:

```yaml
agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"
```

Each line in the trace file should be a JSON object containing the raw agent trace. The converter handles the rest.

For **multi-agent** traces where different agents use different frameworks, you can specify per-agent converters:

```yaml
agentic:
  enabled: true
  trace_converter: multi_agent
  trace_file: "data/multi_agent_traces.jsonl"
  multi_agent:
    agent_converters:
      planner: react
      coder: anthropic
      reviewer: openai
```

### Auto-Detection

If you are unsure which converter to use, set `trace_converter: auto`:

```yaml
agentic:
  enabled: true
  trace_converter: auto
  trace_file: "data/traces.jsonl"
```

Potato inspects the first 10 traces and selects the best-matching converter based on field signatures. A warning is logged if confidence is below 80%, in which case you should specify the converter explicitly.

### Custom Converters

If your agent framework is not listed, you can write a Python converter:

```python
# converters/my_converter.py
from potato.agentic.base_converter import BaseTraceConverter

class MyConverter(BaseTraceConverter):
    name = "my_framework"

    def convert(self, raw_trace: dict) -> dict:
        steps = []
        for entry in raw_trace["log"]:
            steps.append({
                "type": entry.get("kind", "action"),
                "content": entry["text"],
                "timestamp": entry.get("ts"),
                "metadata": entry.get("extra", {}),
            })
        return {"steps": steps}
```

Register it in config:

```yaml
agentic:
  trace_converter: custom
  custom_converter: "converters/my_converter.py:MyConverter"
```

---

## Display Types

Once traces are converted, Potato renders them using one of five specialized display types. Each is optimized for a different agent modality.

### 1. Agent Trace Display

The default display for tool-using agents (OpenAI function calling, Anthropic tool_use, ReAct, LangChain, etc.). It renders each step as a card with color-coding by step type.

```yaml
agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace

  agent_trace_display:
    # Color coding for step types
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
      system: "#6b7280"

    # Collapsible sections
    collapse_observations: true
    collapse_threshold: 500    # characters before auto-collapsing

    # Step numbering
    show_step_numbers: true
    show_timestamps: true

    # Tool call rendering
    render_json: true          # pretty-print JSON arguments
    syntax_highlight: true     # highlight code in observations
```

**Features:**
- **Step cards** with colored left-border indicating type (thought, action, observation, error)
- **Collapsible sections** for long observations or tool outputs (configurable threshold)
- **JSON pretty-printing** for tool call arguments and structured responses
- **Syntax highlighting** for code blocks in observations
- **Step timeline** sidebar showing the full trace at a glance
- **Jump-to-step** navigation for long traces

### 2. Web Agent Trace Display

For web browsing agents (WebArena, VisualWebArena, raw browser recordings). Renders screenshots with SVG overlays showing where the agent clicked, typed, or scrolled.

```yaml
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent

  web_agent_display:
    # Screenshot rendering
    screenshot_max_width: 900
    screenshot_quality: 85

    # SVG overlay for agent actions
    overlay:
      enabled: true
      click_marker: "circle"       # circle, crosshair, or arrow
      click_color: "#ef4444"
      click_radius: 20
      type_highlight: "#3b82f6"    # highlight for text input fields
      scroll_indicator: true

    # Filmstrip view
    filmstrip:
      enabled: true
      thumbnail_width: 150
      show_action_labels: true

    # DOM snapshot display
    show_dom_snapshot: false        # optional raw DOM view
    show_url_bar: true
    show_action_description: true
```

**Features:**
- **Screenshot gallery** with full-size viewing and zoom
- **SVG overlays** showing click targets (red circles), text input regions (blue highlights), and scroll directions
- **Filmstrip view** at the bottom showing all screenshots as thumbnails for quick navigation
- **Action description** text below each screenshot (e.g., "Click on 'Add to Cart' button")
- **URL bar** showing the current page URL at each step
- **Before/after comparison** for steps that modify page content

### 3. Interactive Chat Display

For evaluating conversational agents and chatbots. Supports two sub-modes: **live chat** where annotators interact with the agent in real time, and **trace review** where annotators evaluate a recorded conversation.

```yaml
agentic:
  enabled: true
  display_type: interactive_chat

  interactive_chat_display:
    mode: trace_review         # or "live_chat"

    # Trace review settings
    trace_review:
      show_system_prompt: false
      show_token_counts: true
      show_latency: true
      message_grouping: turn    # "turn" or "message"

    # Live chat settings (when mode: live_chat)
    live_chat:
      proxy: openai             # agent proxy to use
      max_turns: 20
      timeout_seconds: 60
      show_typing_indicator: true
      allow_regenerate: true

    # Common settings
    show_role_labels: true
    role_colors:
      user: "#3b82f6"
      assistant: "#6E56CF"
      system: "#6b7280"
      tool: "#22c55e"
```

**Trace review mode** renders a recorded conversation with optional token counts and latency per message. Annotators can rate individual turns or the entire conversation.

**Live chat mode** connects annotators to a running agent via the Agent Proxy System (see below). Annotators converse with the agent, then annotate the resulting conversation.

### 4. Coding Trace Display

For coding agent sessions (Claude Code, Aider, SWE-Agent). Renders code diffs with syntax highlighting, terminal output in dark blocks, and file reads with line numbers.

```yaml
agentic:
  enabled: true
  trace_converter: claude_code
  display_type: coding_trace

  coding_trace_display:
    diff_style: unified           # unified or split
    terminal_theme: dark
    show_file_tree: true
    collapse_long_output: true
    collapse_threshold: 50        # lines
    show_line_numbers: true
    syntax_highlight: true
```

**Features:**
- **Unified diff view** with red/green highlighting for edit operations
- **Dark terminal blocks** for bash/shell command output
- **Line-numbered code blocks** for file read operations
- **File tree sidebar** showing all files touched during the session
- **Collapsible long outputs** for verbose terminal or file content

See [Coding Agent Annotation](/docs/features/coding-agent-annotation) for the full reference.

### 5. Live Agent Display

Real-time observation of AI agents with controls for human intervention. Supports web browsing agents and coding agents.

```yaml
agentic:
  enabled: true
  display_type: live_agent
```

**Features:**
- **Real-time streaming** of agent actions via Server-Sent Events
- **Pause/Resume** the agent between steps
- **Send instructions** to redirect the agent mid-task
- **Take over** manual control
- **Rollback** to any previous checkpoint (coding agents use git-based checkpoints)
- **Branch and replay** from any checkpoint with different instructions

See [Live Agent Evaluation](/docs/features/live-agent-evaluation) and [Live Coding Agent](/docs/features/live-coding-agent) for configuration details.

---

## Advanced Annotation Types

Beyond per-turn ratings and pre-built schemas, Potato includes four dedicated annotation types for structured agent evaluation.

### Trajectory Evaluation (`trajectory_eval`)

Per-step error localization with hierarchical error taxonomies and severity scoring. Each step gets a correctness rating, error type, severity level, and optional rationale. A running score tracker decrements based on severity.

```yaml
annotation_schemes:
  - annotation_type: trajectory_eval
    name: step_eval
    error_taxonomy:
      reasoning:
        - logical_error
        - incorrect_assumption
      action:
        - wrong_tool
        - wrong_arguments
        - premature_termination
    severity_weights:
      minor: -1
      major: -5
      critical: -10
```

See [Trajectory Evaluation blog post](/blog/trajectory-evaluation-error-taxonomy) for a complete guide.

### Rubric Evaluation (`rubric_eval`)

MT-Bench-style multi-criteria grid evaluation. Define custom criteria and a rating scale. Annotators rate each criterion independently.

```yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: agent_rubric
    criteria:
      - name: correctness
        description: "Did the agent produce the correct result?"
      - name: efficiency
        description: "Did the agent take an efficient path?"
      - name: safety
        description: "Did the agent avoid unsafe actions?"
    scale: 5
    scale_labels:
      1: "Very Poor"
      3: "Acceptable"
      5: "Excellent"
```

See [Rubric Evaluation tutorial](/blog/rubric-evaluation-mt-bench-style) for setup instructions.

### Pairwise Comparison

Compare two agent traces side by side with three modes:

- **Binary**: Click to select A or B (with optional tie)
- **Scale**: Slider from "A much better" to "B much better"
- **Multi-dimension**: Independent A/B/tie per dimension with required justification

```yaml
annotation_schemes:
  - annotation_type: pairwise
    name: agent_comparison
    mode: multi_dimension
    dimensions:
      - correctness
      - efficiency
      - safety
    require_justification: true
    allow_tie: true
```

See [Pairwise Comparison guide](/blog/pairwise-agent-comparison-guide) for all three modes.

### Process Reward Annotation

Per-step binary correctness annotation optimized for training process reward models. Two modes: first-error (click first wrong step, rest auto-marked) and per-step (rate each independently).

```yaml
annotation_schemes:
  - annotation_type: process_reward
    name: prm
    mode: first_error    # or per_step
```

See [Process Reward Annotation](/docs/features/process-reward-annotation) for the full reference.

---

## Per-Turn Ratings

For dialogue and multi-step evaluations, you often need ratings on individual turns rather than (or in addition to) the overall trace. Potato supports per-turn annotation for any display type.

```yaml
annotation_schemes:
  # Overall trace rating
  - annotation_type: likert
    name: overall_quality
    description: "Rate the overall quality of this agent trace"
    min: 1
    max: 5
    labels:
      1: "Very Poor"
      5: "Excellent"

  # Per-turn ratings
  - annotation_type: per_turn_rating
    name: step_correctness
    description: "Was this step correct?"
    target: agentic_steps        # binds to trace steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"

  - annotation_type: per_turn_rating
    name: step_explanation
    description: "Explain any issues with this step"
    target: agentic_steps
    rating_type: text
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]
```

Per-turn ratings appear inline next to each step card. The `conditional` block lets you show follow-up questions only when certain ratings are selected, keeping the interface clean.

### Per-Turn Output Format

Per-turn annotations are saved with step indices:

```json
{
  "id": "trace_042",
  "annotations": {
    "overall_quality": 3,
    "step_correctness": {
      "0": "Correct",
      "1": "Correct",
      "2": "Incorrect",
      "3": "Correct"
    },
    "step_explanation": {
      "2": "The agent searched for the wrong product name"
    }
  }
}
```

---

## Agent Proxy System

For live evaluation tasks where annotators interact with an agent in real time, Potato provides an agent proxy layer. The proxy sits between the annotation interface and the agent backend, logging the full conversation for later review.

```yaml
agentic:
  enabled: true
  display_type: interactive_chat

  agent_proxy:
    type: openai                 # openai, http, or echo

    # OpenAI proxy
    openai:
      model: "gpt-4o"
      api_key: ${OPENAI_API_KEY}
      system_prompt: "You are a helpful customer service agent."
      temperature: 0.7
      max_tokens: 1024
```

### Proxy Types

**OpenAI proxy** forwards messages to an OpenAI-compatible API:

```yaml
agent_proxy:
  type: openai
  openai:
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    system_prompt: "You are a helpful assistant."
    temperature: 0.7
```

**HTTP proxy** forwards messages to any HTTP endpoint (your own agent server):

```yaml
agent_proxy:
  type: http
  http:
    url: "https://my-agent.example.com/chat"
    method: POST
    headers:
      Authorization: "Bearer ${AGENT_API_KEY}"
    request_template:
      messages: "{{messages}}"
      session_id: "{{session_id}}"
    response_path: "response.content"
    timeout_seconds: 30
```

**Echo proxy** mirrors the user's message back (useful for testing and UI development):

```yaml
agent_proxy:
  type: echo
  echo:
    prefix: "[Echo] "
    delay_ms: 500
```

---

## Pre-Built Annotation Schemas

Potato ships 9 annotation schemas designed specifically for agent evaluation. Use them directly or as starting points for your own schemas.

| Schema | Type | Description |
|--------|------|-------------|
| `agent_task_success` | radio | Binary success/failure with partial credit option |
| `agent_step_correctness` | per_turn_rating (radio) | Per-step correct/incorrect/unnecessary ratings |
| `agent_error_taxonomy` | per_turn_rating (multiselect) | 12-category error taxonomy (wrong tool, hallucination, loop, etc.) |
| `agent_safety` | radio + text | Safety violation detection with severity scale |
| `agent_efficiency` | likert | Rate whether the agent used an efficient path |
| `agent_instruction_following` | likert | Rate adherence to the original user instruction |
| `agent_explanation_quality` | likert | Rate quality of agent's reasoning/explanations |
| `agent_web_action_correctness` | per_turn_rating (radio) | Per-step web action evaluation (correct target, correct action type) |
| `agent_conversation_quality` | multirate | Multi-dimensional chat quality (helpfulness, accuracy, tone, safety) |

Load a pre-built schema by name:

```yaml
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
```

Or combine presets with custom schemas:

```yaml
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness

  # Custom schema alongside presets
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations about this agent trace"
    label_requirement:
      required: false
```

---

## Full Example: Evaluating a ReAct Agent

Here is a complete configuration for evaluating ReAct-style agent traces with per-step ratings:

```yaml
# project config
task_name: "ReAct Agent Evaluation"
task_dir: "."

data_files:
  - "data/react_traces.jsonl"

item_properties:
  id_key: trace_id
  text_key: task_description

agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace

  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 300
    show_step_numbers: true
    render_json: true

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_efficiency

  - annotation_type: text
    name: failure_reason
    description: "If the agent failed, describe what went wrong"
    label_requirement:
      required: false

output_annotation_dir: "output/"
output_annotation_format: "jsonl"
```

Sample input data (`data/react_traces.jsonl`):

```json
{
  "trace_id": "react_001",
  "task_description": "Find the population of Tokyo and compare it to New York City",
  "trace": [
    {"type": "thought", "content": "I need to find the population of both cities. Let me search for Tokyo first."},
    {"type": "action", "content": "search", "action_input": "Tokyo population 2024"},
    {"type": "observation", "content": "Tokyo has a population of approximately 13.96 million in the city proper..."},
    {"type": "thought", "content": "Now I need to find New York City's population."},
    {"type": "action", "content": "search", "action_input": "New York City population 2024"},
    {"type": "observation", "content": "New York City has a population of approximately 8.34 million..."},
    {"type": "thought", "content": "Tokyo (13.96M) has about 67% more people than NYC (8.34M)."},
    {"type": "action", "content": "finish", "action_input": "Tokyo has ~13.96 million people vs NYC's ~8.34 million, making Tokyo about 67% larger by population."}
  ]
}
```

Start the server:

```bash
potato start config.yaml -p 8000
```

---

## Further Reading

- [Coding Agent Annotation](/docs/features/coding-agent-annotation) — diff rendering, terminal output, file tree for coding agents
- [Process Reward Annotation](/docs/features/process-reward-annotation) — PRM training data with first-error and per-step modes
- [Code Review Annotation](/docs/features/code-review-annotation) — GitHub PR-style inline comments and verdicts
- [Live Coding Agent](/docs/features/live-coding-agent) — real-time coding agent observation with rollback and branching
- [Live Agent Evaluation](/docs/features/live-agent-evaluation) — real-time web agent observation
- [Web Agent Annotation](/docs/features/web-agent-annotation) — review pre-recorded web agent traces
- [Evaluating AI Agents: A Complete Guide](/blog/evaluating-ai-agents-with-potato) — walkthrough of a full agent evaluation project
- [Export Formats](/docs/features/export-formats) — export agent evaluation data

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/agentic_annotation.md).
