# How to Evaluate Multi-Agent Systems

Source: https://www.potatoannotator.com/docs/guides/evaluating-multi-agent-systems

**A multi-agent system is several LLM agents (a planner, a coder, a reviewer, and so on) cooperating on one task. Evaluating it means more than scoring the final answer, because the failures that matter happen between agents: a dropped constraint at a handoff, the wrong agent taking over, a team that never verifies its work. The useful unit of judgment is which agent, which step, and which handoff.** Potato is an open-source tool for human evaluation of multi-agent runs, with a [purpose-built set of annotation surfaces](/docs/agent-evaluation/multi-agent-evaluation) for team structure.

A [multi-agent system](https://en.wikipedia.org/wiki/Multi-agent_system) here means an [LLM](https://en.wikipedia.org/wiki/Large_language_model)-driven workflow where distinct agents, each with a role, exchange messages and hand off control. Research on why these systems fail (the [MAST taxonomy](https://arxiv.org/abs/2503.13657), *Why Do Multi-Agent LLM Systems Fail?*) finds that a large share of failures are inter-agent: specification problems, misalignment between agents, and missing verification. A flat transcript hides exactly those.

## Why isn't single-agent evaluation enough?

When you evaluate one agent, you judge a single sequence of thoughts, tool calls, and observations. A team adds failure modes that only exist between agents:

- **Handoff loss**: agent A knows a constraint that agent B never receives.
- **Misattribution**: the run fails, but the responsible agent is upstream of where the error surfaced.
- **Coordination failure**: each agent is individually competent, yet the team loops, stalls, or never verifies.
- **Resource contention**: two agents touch the same tool or file at once and deadlock.

Scoring only the final output tells you *that* the team failed, not *where*. Attribution is what makes the data useful for debugging or training.

## How do I attribute a multi-agent failure?

The failure-attribution literature (Zhang et al., *Which Agent Causes Task Failures and When?*, ICML 2025) frames the label as a triple: the **responsible agent**, the **decisive step**, and a **reason**. In Potato the `failure_attribution` schema populates the agent and step pickers from the trace itself, so the annotator chooses from agents and steps that actually occurred:

```yaml
annotation_schemes:
  - annotation_type: radio
    name: outcome
    description: "Did the system succeed?"
    labels: [success, failure]
  - annotation_type: failure_attribution
    name: attribution
    description: "If it failed: which agent, which step, and why?"
    steps_key: steps
    agent_key: agent
```

Pairing the outcome scheme with attribution means the triple is only collected on runs that actually failed.

## How do I review the team structure, not just the transcript?

Two surfaces make the structure visible. The [interaction graph](/docs/agent-evaluation/multi-agent-evaluation#interaction-graph-agent_interaction_graph) renders agents as nodes and handoffs as edges, and the annotator marks the critical path and flags problematic edges. [Handoff review](/docs/agent-evaluation/multi-agent-evaluation#handoff-review-handoff_review) turns every control transfer into a card to flag misalignment and rate quality:

```yaml
annotation_schemes:
  - annotation_type: handoff_review
    name: handoffs
    description: "For each handoff: flag any misalignment and rate the quality."
    steps_key: steps
    agent_key: agent
    flags: [info_loss, dropped_constraint, garbling, goal_drift]
    quality_scale: 5
```

For scoring, the [`agent_scorecard`](/docs/agent-evaluation/multi-agent-evaluation#per-agent-and-per-team-scorecard-agent_scorecard) rates each agent on role fidelity, contribution, and coordination, and scores the team on its own dimensions, so a strong individual agent inside a poorly coordinated team is visible in the numbers.

## Which method should I use?

- **Debugging a pipeline**: start with the interaction graph and failure attribution to localize where runs break.
- **Comparing orchestration patterns**: add the scorecard to score sequential vs. hierarchical vs. group-chat designs on the same tasks.
- **Building training or reward data**: tag failures at step granularity with the MAST modes (via [`trajectory_eval`](/docs/guides/agent-trajectory-annotation)) so the labels attach to the acting agent and step.
- **Concurrency bugs**: use the tool-contention timeline to catch deadlocks and races a transcript cannot show.

Measure agreement on attribution the same way you would for any subjective label; see [Inter-Annotator Agreement](/docs/guides/inter-annotator-agreement).

## Further reading

- [Multi-Agent Team Evaluation](/docs/agent-evaluation/multi-agent-evaluation) — the full schema reference
- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents) — the levels of agent evaluation
- [Annotating Agent Trajectories](/docs/guides/agent-trajectory-annotation) — per-step error taxonomies
- [Evaluating Computer-Use and Multimodal Agents](/docs/guides/evaluating-computer-use-agents)
