# Three-Pane Trace Evaluation (eval_trace)

Source: https://www.potatoannotator.com/docs/agent-evaluation/eval-trace

**The `eval_trace` display splits a single agent trace into three synchronized panes: Reasoning, Function Calls, and Final Answer.** An evaluator sees what the agent thought, what it did, and what it produced side by side, which suits continuous evaluation where new traces arrive and must be judged quickly.

Unlike a vertical agent-trace display that stacks an interleaved trace in one column, `eval_trace` decomposes one trace into its three semantic components, so the structure of the agent's behavior is visible without scrolling.

![Three-pane eval_trace display](/images/docs/eval-trace.png "An agent trace split into Reasoning, Function Calls, and Final Answer panes")

## Quick start

Run the included example from the repository root:

```bash
python potato/flask_server.py start examples/agent-traces/continuous-eval/config.yaml -p 8000
```

The example also ships a directory-watch variant (`config-watch.yaml`) for ingesting dropped trace files.

## Configuration

```yaml
instance_display:
  layout:
    direction: vertical      # task header above the (internally horizontal) panes
    gap: 12px
  fields:
    - key: task_description
      type: text
      label: "Task"

    - key: trace             # the field holding the agent trace
      type: eval_trace
      label: "Agent Trace"
      display_options:
        pane_labels: ["Reasoning", "Function Calls", "Final Answer"]
        show_step_numbers: true
        collapse_long_outputs: true
        max_output_lines: 12
        link_steps: true
```

### Options

| Option | Default | Description |
|--------|---------|-------------|
| `pane_labels` | `["Reasoning", "Function Calls", "Final Answer"]` | Headers for the three panes. |
| `show_step_numbers` | `true` | Show `#N` step numbers on reasoning and call cards. |
| `collapse_long_outputs` | `true` | Collapse tool results longer than `max_output_lines` into an expandable block. |
| `max_output_lines` | `20` | Line threshold for collapsing results. |
| `link_steps` | `true` | Cross-pane highlighting: clicking a card highlights the linked cards in the other panes. |
| `compact` | `false` | Tighter padding and spacing. |

## Data format

`eval_trace` accepts the same trace formats as the agent-trace display. The most common is a list of `{speaker, text}` steps:

```json
{
  "id": "eval_001",
  "task_description": "Find a vegan lasagna recipe.",
  "trace": [
    {"speaker": "Agent (Thought)",      "text": "I'll search for a highly-rated recipe."},
    {"speaker": "Agent (Action)",       "text": "web_search(query='vegan lasagna')"},
    {"speaker": "Environment",          "text": "10 results found..."},
    {"speaker": "Agent (Final Answer)", "text": "Here's a great recipe: ..."}
  ]
}
```

The `thought/action/observation` and `step_type/content` formats are also supported.

### How steps map to panes

| Step (type inferred from speaker or label) | Pane |
|---|---|
| `Thought`, reasoning, planning, `system` | **Reasoning** |
| `Action`, tool, function, call | **Function Calls** (the adjacent `Environment`/result nests under the call) |
| `Final Answer`, `send_message`, `respond`, `finish`, or the last action if none match | **Final Answer** |

To set an explicit final answer, end the trace with a step whose speaker matches an answer pattern (such as `"Agent (Final Answer)"`) or a `send_message(...)` action.

### Step linking

Steps are grouped into logical cycles: a thought plus the calls it triggers share a step index. With `link_steps: true`, clicking any card highlights every card sharing that index across the panes, so you can trace a thought to the action it produced.

## Continuous evaluation

Pair `eval_trace` with any of Potato's runtime ingestion transports so traces are evaluated as they arrive:

- **Webhook and SSE** — `trace_ingestion: {enabled: true}` exposes a webhook endpoint and streams new traces to annotators.
- **Langfuse polling** — add a `langfuse` source under `trace_ingestion.sources`.
- **Directory watch** — `data_directory` plus `watch_data_directory: true` ingests dropped `.json` and `.jsonl` files.

Runtime-added traces are immediately assignable to annotators. Combine this with the [Triage Queue](/docs/agent-evaluation/triage-queue) to push errored or low-scoring traces to the front.

## Notes and limitations

- `eval_trace` is display-only; it collects no annotations itself. Pair it with annotation schemes such as `reasoning_quality`, `tool_use_correctness`, or `answer_helpfulness`, as in the example.
- Span annotation is not supported on `eval_trace`. Use an agent-trace or `code` display if you need span highlighting on trace text.

## Related

- [Signal-Based Triage Queue](/docs/agent-evaluation/triage-queue) — prioritize which incoming traces get reviewed
- [Trajectory Editing for SFT/DPO](/docs/agent-evaluation/trajectory-correction) — turn corrections into training data
- [Agentic Annotation](/docs/features/agentic-annotation) — display-field configuration and ingestion

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/agent-evaluation/eval_trace.md).
