Skip to content

Three-Pane Trace Evaluation (eval_trace)

The eval_trace display splits one agent trace into three synchronized panes (Reasoning, Function Calls, and Final Answer) so an evaluator sees what the agent thought, did, and produced at a glance. Built for continuous evaluation.

The eval_trace display splits a single agent trace into three synchronized panes: Reasoning, Function Calls, and Final Answer. An evaluator sees what the agent thought, what it did, and what it produced side by side, which suits continuous evaluation where new traces arrive and must be judged quickly.

Unlike a vertical agent-trace display that stacks an interleaved trace in one column, eval_trace decomposes one trace into its three semantic components, so the structure of the agent's behavior is visible without scrolling.

Three-pane eval_trace displayAn agent trace split into Reasoning, Function Calls, and Final Answer panes

Quick start

Run the included example from the repository root:

bash
python potato/flask_server.py start examples/agent-traces/continuous-eval/config.yaml -p 8000

The example also ships a directory-watch variant (config-watch.yaml) for ingesting dropped trace files.

Configuration

yaml
instance_display:
  layout:
    direction: vertical      # task header above the (internally horizontal) panes
    gap: 12px
  fields:
    - key: task_description
      type: text
      label: "Task"
 
    - key: trace             # the field holding the agent trace
      type: eval_trace
      label: "Agent Trace"
      display_options:
        pane_labels: ["Reasoning", "Function Calls", "Final Answer"]
        show_step_numbers: true
        collapse_long_outputs: true
        max_output_lines: 12
        link_steps: true

Options

OptionDefaultDescription
pane_labels["Reasoning", "Function Calls", "Final Answer"]Headers for the three panes.
show_step_numberstrueShow #N step numbers on reasoning and call cards.
collapse_long_outputstrueCollapse tool results longer than max_output_lines into an expandable block.
max_output_lines20Line threshold for collapsing results.
link_stepstrueCross-pane highlighting: clicking a card highlights the linked cards in the other panes.
compactfalseTighter padding and spacing.

Data format

eval_trace accepts the same trace formats as the agent-trace display. The most common is a list of {speaker, text} steps:

json
{
  "id": "eval_001",
  "task_description": "Find a vegan lasagna recipe.",
  "trace": [
    {"speaker": "Agent (Thought)",      "text": "I'll search for a highly-rated recipe."},
    {"speaker": "Agent (Action)",       "text": "web_search(query='vegan lasagna')"},
    {"speaker": "Environment",          "text": "10 results found..."},
    {"speaker": "Agent (Final Answer)", "text": "Here's a great recipe: ..."}
  ]
}

The thought/action/observation and step_type/content formats are also supported.

How steps map to panes

Step (type inferred from speaker or label)Pane
Thought, reasoning, planning, systemReasoning
Action, tool, function, callFunction Calls (the adjacent Environment/result nests under the call)
Final Answer, send_message, respond, finish, or the last action if none matchFinal Answer

To set an explicit final answer, end the trace with a step whose speaker matches an answer pattern (such as "Agent (Final Answer)") or a send_message(...) action.

Step linking

Steps are grouped into logical cycles: a thought plus the calls it triggers share a step index. With link_steps: true, clicking any card highlights every card sharing that index across the panes, so you can trace a thought to the action it produced.

Continuous evaluation

Pair eval_trace with any of Potato's runtime ingestion transports so traces are evaluated as they arrive:

  • Webhook and SSEtrace_ingestion: {enabled: true} exposes a webhook endpoint and streams new traces to annotators.
  • Langfuse polling — add a langfuse source under trace_ingestion.sources.
  • Directory watchdata_directory plus watch_data_directory: true ingests dropped .json and .jsonl files.

Runtime-added traces are immediately assignable to annotators. Combine this with the Triage Queue to push errored or low-scoring traces to the front.

Notes and limitations

  • eval_trace is display-only; it collects no annotations itself. Pair it with annotation schemes such as reasoning_quality, tool_use_correctness, or answer_helpfulness, as in the example.
  • Span annotation is not supported on eval_trace. Use an agent-trace or code display if you need span highlighting on trace text.

For implementation details, see the source documentation.