Skip to content
Guides7 min read

Debugging Multi-Agent Failures: A Walkthrough

How to find why a multi-agent LLM system failed using Potato: the interaction graph, failure attribution, handoff review, per-agent scorecards, tool-contention timeline, and emergent-behavior tagging.

Potato Team

When a team of agents fails, the hard part is not noticing the failure — it is finding which agent caused it, at which step, and whether the real problem was a bad handoff between two agents that were each fine on their own. This walkthrough goes through the six Potato surfaces built for that, in the order you would actually use them on a broken run. Everything here is configured in YAML and runs on your own server; the full schema reference is Multi-Agent Team Evaluation.

A multi-agent system is several LLM agents with distinct roles — a planner, a coder, a reviewer — passing messages and handing off control. Research on why these systems break, the MAST taxonomy (Why Do Multi-Agent LLM Systems Fail?), found that most failures are inter-agent: a constraint dropped at a handoff, a team that never verifies its own work, agents talking past each other. A flat chat transcript hides exactly those, because the thing that went wrong lives in the space between two messages, not inside either one.

Where a multi-agent run actually fails: the interaction graph and the attribution tripleThe failure is between agents, at a handoff, not inside one transcript

How do I see the structure of a multi-agent run?

Start with the shape of the run, not the text. The agent_interaction_graph schema renders the whole run as a directed graph: nodes are agents, edges are the handoffs between them, thicker edges meaning more traffic. You click a node to mark it on the critical path and click an edge to cycle it from normal to critical to problematic.

A clickable agent-interaction graph with the critical path and a flagged handoffMark the critical path and flag problematic handoffs

yaml
annotation_schemes:
  - annotation_type: agent_interaction_graph
    name: graph
    description: "Mark the critical path and flag any problematic handoffs."
    steps_key: steps
    agent_key: agent

The graph is laid out automatically from the trace, so you do not draw anything. Every node and edge is keyboard-focusable and a text summary lists the critical nodes and flagged edges, so the meaning never rests on color alone. This view is the fastest way to answer "what talked to what, and where did the path go sideways."

How do I attribute a multi-agent failure to one agent?

Once you can see the run, pin the failure down. The failure_attribution schema asks for the triple from the failure-attribution literature (Zhang et al., Which Agent Causes Task Failures and When?, ICML 2025, the Who&When dataset): the responsible agent, the decisive step, and the reason. The agent dropdown and step picker are populated from the trace's own turns, so you can only attribute the failure to an agent and a step that actually happened.

Attributing a multi-agent failure to an agent, a step, and a reasonAttribute the failure to the responsible agent, the decisive step, and why

yaml
annotation_schemes:
  - annotation_type: radio
    name: outcome
    description: "Did the system succeed?"
    labels: [success, failure]
  - annotation_type: failure_attribution
    name: attribution
    description: "If it failed: which agent, which step, and why?"
    steps_key: steps
    agent_key: agent

Pairing attribution with a success/failure radio means the triple is only collected on runs that failed, which keeps the annotator's time on the cases that carry signal.

What about the handoffs themselves?

Attribution names one decisive step. Handoff review looks at every control transfer. Wherever the acting agent changes between consecutive turns, Potato emits a handoff card A → B, and you flag what went wrong in the pass — information loss, a dropped constraint, garbling, goal drift — and rate the quality. The failure modes come from MAST's inter-agent category and the "echoing" phenomenon (Zhang et al., 2025).

Handoff cards with misalignment flags and a quality ratingFlag inter-agent misalignment on every handoff and rate its quality

yaml
annotation_schemes:
  - annotation_type: handoff_review
    name: handoffs
    description: "For each handoff: flag any misalignment and rate the quality."
    steps_key: steps
    agent_key: agent
    flags: [info_loss, dropped_constraint, garbling, goal_drift]
    quality_scale: 5

Handoffs are derived at render time, so there is no manual setup. This is usually where the "each agent looked fine, the team still failed" cases resolve: the constraint was alive in agent A and gone by agent B.

How do I score the agents and the team?

A failure tells you what broke once. A scorecard tells you whether a design is good across many runs. The agent_scorecard schema scores two levels at once (MultiAgentBench, Zhou et al., ACL 2025): each agent on role fidelity, contribution, and coordination, and the team on its own shared dimensions, with optional milestones. Agent rows come from the trace, so the matrix matches who actually participated.

Per-agent and per-team scorecard with milestonesScore each agent on role fidelity, contribution, and coordination, plus the team

yaml
annotation_schemes:
  - annotation_type: agent_scorecard
    name: scorecard
    description: "Score each agent, the team, and which milestones were reached."
    steps_key: steps
    agent_key: agent
    scale: 5
    agent_dimensions: [role fidelity, contribution, coordination]
    team_dimensions: [coordination, communication, efficiency]
    milestones: [plan produced, task delegated correctly, result verified]

A strong agent stuck inside a poorly coordinated team shows up here as a high agent row next to low team dimensions, which is the pattern you want when you are comparing sequential against hierarchical against group-chat orchestration on the same tasks.

What about concurrency and collective failures?

Two more surfaces catch failures a turn-by-turn read cannot. The tool_contention timeline puts each agent on its own lane and highlights regions where two calls touch the same resource at overlapping times, which you classify as deadlock, circular wait, race condition, or benign (DPBench, 2026).

Per-agent tool-call timeline with a highlighted contention regionSpot deadlocks and race conditions on a per-agent tool-call timeline

And emergent_behavior handles failures that are collective rather than located at one step — collusion, groupthink, cascading errors, role drift. An emergent behavior is not a contiguous span; it is a set of participating turns, possibly from different agents, so you check the turns that take part and add a note.

Tagging a set of turns across agents as a cascading errorTag collusion, groupthink, and cascading errors across agents and turns

yaml
annotation_schemes:
  - annotation_type: tool_contention
    name: contention
    description: "Classify each shared-resource contention region."
    calls_key: calls
    agent_key: agent
    resource_key: resource
    contention_labels: [deadlock, circular_wait, race_condition, benign]
  - annotation_type: emergent_behavior
    name: emergent
    description: "For each collective behavior, check the turns that participate."
    steps_key: steps
    agent_key: agent
    behaviors: [collusion, groupthink, cascading_error, role_drift]
    allow_note: true

Putting it in order

On a real broken run the sequence is usually: read the interaction graph to see the shape, use failure attribution to name the decisive step, open handoff review if the decisive step was a transfer, and reach for the contention timeline or emergent-behavior tagging when the failure is about timing or the group rather than one agent. Score with the scorecard once you are comparing designs rather than debugging one run. Measure agreement on attribution the way you would any subjective label; see Inter-Annotator Agreement.

Further reading