How to Evaluate Multi-Agent Systems

A practical guide to evaluating multi-agent LLM systems, attributing failures to the responsible agent and handoff, reviewing the interaction graph, and scoring each agent and the team.

A multi-agent system is several LLM agents (a planner, a coder, a reviewer, and so on) cooperating on one task. Evaluating it means more than scoring the final answer, because the failures that matter happen between agents: a dropped constraint at a handoff, the wrong agent taking over, a team that never verifies its work. The useful unit of judgment is which agent, which step, and which handoff. Potato is an open-source tool for human evaluation of multi-agent runs, with a purpose-built set of annotation surfaces for team structure.

A multi-agent system here means an LLM-driven workflow where distinct agents, each with a role, exchange messages and hand off control. Research on why these systems fail (the MAST taxonomy, Why Do Multi-Agent LLM Systems Fail?) finds that a large share of failures are inter-agent: specification problems, misalignment between agents, and missing verification. A flat transcript hides exactly those.

Why isn't single-agent evaluation enough?

When you evaluate one agent, you judge a single sequence of thoughts, tool calls, and observations. A team adds failure modes that only exist between agents:

Handoff loss: agent A knows a constraint that agent B never receives.
Misattribution: the run fails, but the responsible agent is upstream of where the error surfaced.
Coordination failure: each agent is individually competent, yet the team loops, stalls, or never verifies.
Resource contention: two agents touch the same tool or file at once and deadlock.

Scoring only the final output tells you that the team failed, not where. Attribution is what makes the data useful for debugging or training.

How do I attribute a multi-agent failure?

The failure-attribution literature (Zhang et al., Which Agent Causes Task Failures and When?, ICML 2025) frames the label as a triple: the responsible agent, the decisive step, and a reason. In Potato the failure_attribution schema populates the agent and step pickers from the trace itself, so the annotator chooses from agents and steps that actually occurred:

yaml

annotation_schemes:
  - annotation_type: radio
    name: outcome
    description: "Did the system succeed?"
    labels: [success, failure]
  - annotation_type: failure_attribution
    name: attribution
    description: "If it failed: which agent, which step, and why?"
    steps_key: steps
    agent_key: agent

Pairing the outcome scheme with attribution means the triple is only collected on runs that actually failed.

How do I review the team structure, not just the transcript?

Two surfaces make the structure visible. The interaction graph renders agents as nodes and handoffs as edges, and the annotator marks the critical path and flags problematic edges. Handoff review turns every control transfer into a card to flag misalignment and rate quality:

yaml

annotation_schemes:
  - annotation_type: handoff_review
    name: handoffs
    description: "For each handoff: flag any misalignment and rate the quality."
    steps_key: steps
    agent_key: agent
    flags: [info_loss, dropped_constraint, garbling, goal_drift]
    quality_scale: 5

For scoring, the agent_scorecard rates each agent on role fidelity, contribution, and coordination, and scores the team on its own dimensions, so a strong individual agent inside a poorly coordinated team is visible in the numbers.

Which method should I use?

Debugging a pipeline: start with the interaction graph and failure attribution to localize where runs break.
Comparing orchestration patterns: add the scorecard to score sequential vs. hierarchical vs. group-chat designs on the same tasks.
Building training or reward data: tag failures at step granularity with the MAST modes (via trajectory_eval) so the labels attach to the acting agent and step.
Concurrency bugs: use the tool-contention timeline to catch deadlocks and races a transcript cannot show.

Measure agreement on attribution the same way you would for any subjective label; see Inter-Annotator Agreement.

How to Evaluate Multi-Agent Systems

Why isn't single-agent evaluation enough?

How do I attribute a multi-agent failure?

How do I review the team structure, not just the transcript?

Which method should I use?

Further reading