多智能体团队评估

按团队结构而非扁平的对话记录来标注多智能体系统。Potato 提供可点击的智能体交互图、跨智能体失败归因、交接审查、单智能体与整队评分卡、工具争用时间线，以及涌现行为标记。

多智能体系统的失败方式与单个智能体不同：故障发生在智能体之间、在某次交接处，或源于团队的组织方式。评估它意味着把结果归因到是哪个智能体、哪一步、哪次交接，而不只是给一段扁平的对话记录打分。 Potato 为此提供了一组专门构建的标注界面：可点击的交互图、失败归因、交接审查、单智能体与整队评分卡、工具争用时间线，以及跨泳道的涌现行为标记。

这些都建立在智能体轨迹视图与 MAST 失败分类体系之上。每个 schema 在渲染时都从轨迹本身推导出其中的智能体、步骤与交接，因此标注者只能从运行中实际发生的内容中进行选择。

交互图（`agent_interaction_graph`）

整次运行渲染为一张有向图：节点是智能体，边是它们之间的消息与交接转移（边越粗表示越频繁），布局根据轨迹自动生成。标注者点击节点以标记关键路径，点击边以在正常 → 关键 → 有问题之间循环切换。对于"我如何看清一次多智能体运行的结构"这个问题，这是最清晰的答案，也是通用标注工具不提供的一种界面。

可点击的智能体交互图，带关键路径和被标记的交接 Mark the critical path and flag problematic handoffs on a clickable agent-interaction graph

yaml

annotation_schemes:
  - annotation_type: agent_interaction_graph
    name: graph
    description: "Mark the critical path and flag any problematic handoffs."
    steps_key: steps
    agent_key: agent

存储为 {"critical_nodes": [...], "edges": {"A->B": "problematic", ...}}。每个节点和边都可通过键盘聚焦，并有一段实时文本摘要列出关键节点和被标记的边，因此含义绝不会仅靠颜色来传达。

跨智能体失败归因（`failure_attribution`）

当一个团队失败时，有用的标签是来自失败归因文献的**（责任智能体、决定性步骤、原因）**三元组（Zhang 等人，Which Agent Causes Task Failures and When?，ICML 2025，Who&When 数据集）。智能体下拉框和步骤选择器由轨迹自身的回合填充，因此标注者会把失败归因到一个真实的智能体和一个真实的步骤上。

将一次多智能体失败归因到某个智能体、某一步骤和某个原因 Attribute a multi-agent failure to the responsible agent, the decisive step, and why

yaml

annotation_schemes:
  - annotation_type: failure_attribution
    name: attribution
    description: "If it failed: which agent, which step, and why?"
    steps_key: steps
    agent_key: agent
    # agents: [Planner, Coder, Reviewer]   # optional static list instead of deriving from the trace

存储为 {"responsible_agent", "decisive_step", "reason"}。将它与一个 radio 结果 schema（成功/失败）搭配使用，使归因只在失败的运行上触发。

交接审查（`handoff_review`）

每一次交接，即一个智能体将控制权传给另一个智能体，都成为一个可标注的一等对象。只要行动智能体在相邻回合之间发生变化，Potato 就会生成一张交接卡片 A → B；标注者标记智能体间的失配并评定交接质量。这些失败模式以 MAST 的智能体间类别和"回声"现象为依据（Zhang 等人，2025）。

带失配标记和质量评分的交接卡片 Flag inter-agent misalignment on every handoff and rate its quality

yaml

annotation_schemes:
  - annotation_type: handoff_review
    name: handoffs
    description: "For each handoff: flag any misalignment and rate the quality."
    steps_key: steps
    agent_key: agent
    flags: [info_loss, dropped_constraint, garbling, goal_drift]
    quality_scale: 5

交接在渲染时从轨迹中推导出来，因此无需手动设置。存储为一个 {index, step, from, to, flags, quality} 列表。

单智能体与整队评分卡（`agent_scorecard`）

在两个层级上同时为一次运行打分（MultiAgentBench，Zhou 等人，ACL 2025）：每个智能体获得各维度的分数（角色忠实度、贡献度、协调性），团队获得共享维度的分数，并可选地勾选里程碑。智能体的行来自轨迹自身的回合，因此该矩阵与实际参与者相匹配。

带里程碑的单智能体与整队评分卡 Score every agent on role fidelity, contribution, and coordination, plus the team and milestones

yaml

annotation_schemes:
  - annotation_type: agent_scorecard
    name: scorecard
    description: "Score each agent, the team, and which milestones were reached."
    steps_key: steps
    agent_key: agent
    scale: 5
    agent_dimensions: [role fidelity, contribution, coordination]
    team_dimensions: [coordination, communication, efficiency]
    milestones: [plan produced, task delegated correctly, result verified]   # optional

存储为 {"agents": {name: {dim: score}}, "team": {dim: score}, "milestones": {name: bool}}。

工具／资源争用时间线（`tool_contention`）

跨智能体的并发工具与资源使用会渲染在一条多泳道时间线上，每个智能体一条泳道。两次调用在重叠时间内触及同一资源的区域会跨泳道高亮，并列出以供分类：死锁、循环等待、竞态条件或良性（DPBench，2026）。这就是你捕捉逐回合对话记录所隐藏的并发故障的方式。

带高亮争用区域的逐智能体工具调用时间线 Spot deadlocks and race conditions on a per-agent tool-call timeline

yaml

annotation_schemes:
  - annotation_type: tool_contention
    name: contention
    description: "Classify each shared-resource contention region."
    calls_key: calls          # list of {agent, tool, start, end, resource}
    agent_key: agent
    resource_key: resource
    contention_labels: [deadlock, circular_wait, race_condition, benign]

争用区域在渲染时计算得出（相同的 resource、重叠的区间）。存储为 {"contentions": {idx: label}}。

跨泳道涌现行为（`emergent_behavior`）

有些失败是集体性的：合谋、群体思维、级联错误、角色漂移。一种涌现行为并不是一段连续的文本跨度；它是一个参与回合的集合，可能来自不同的智能体。对于每一种行为，标注者勾选参与其中的回合并添加备注，即一个以回合集合表达的跨泳道跨度。

将跨智能体的一组回合标记为级联错误 Tag collusion, groupthink, and cascading errors across agents and turns

yaml

annotation_schemes:
  - annotation_type: emergent_behavior
    name: emergent
    description: "For each collective behavior, check the turns (across agents) that participate."
    steps_key: steps
    agent_key: agent
    behaviors: [collusion, groupthink, cascading_error, role_drift]
    allow_note: true

存储为 {behavior: {turns: [idx...], note}}，只保留非空的行为。

工具调用审查（`tool_call_review`）

逐一评判每次工具或函数调用：是否选对了工具，参数是否正确，顺序是否得当（对标 BFCL v4 / MCPMark）？工具调用在渲染时从轨迹步骤中提取；每一步的 tool_calls、tool_call 或 action 都会成为一张带工具名称和美化打印参数的卡片。

对轨迹中每次工具调用的逐调用裁定 Judge every tool call: right tool, correct arguments, right order

yaml

annotation_schemes:
  - annotation_type: tool_call_review
    name: tool_review
    description: "Judge each tool call: right tool? correct arguments?"
    steps_key: steps
    # verdict_options: [correct, wrong_tool, wrong_args, wrong_order]   # customizable

存储为一个 {index, step, tool, verdict, notes} 列表。

步骤粒度的 MAST 标记

你无需新的 schema，就能把 14 种模式的 MAST 失败分类体系（Cemri 等人，Why Do Multi-Agent LLM Systems Fail?，2025）绑定到失败发生的确切步骤（因而也绑定到行动智能体）。把现有的逐步 trajectory_eval schema 配置为以 MAST 模式作为它的 error_types，并按三个 MAST 类别分组。将它与 failure_attribution 和 handoff_review 搭配以实现完整覆盖。

yaml

annotation_schemes:
  - annotation_type: trajectory_eval
    name: mast_steps
    description: "Tag each step with the MAST failure mode(s) it exhibits."
    steps_key: steps
    step_text_key: content
    error_types:
      - name: "Specification & System Design"
        subtypes: ["Disobey task specification", "Disobey role specification", "Step repetition", "Loss of conversation history", "Unaware of termination conditions"]
      - name: "Inter-Agent Misalignment"
        subtypes: ["Conversation reset", "Fail to ask for clarification", "Task derailment", "Information withholding", "Ignored other agent's input", "Reasoning-action mismatch"]
      - name: "Task Verification & Termination"
        subtypes: ["Premature termination", "No or incomplete verification", "Incorrect verification"]

选择编排视角

编排架构往往主导一次运行的结果，因此值得将其作为一等标签来捕获。无需新的 schema：一个 radio 确认或纠正该运行的模式，进而引导评估视角以及轨迹的布局方式（顺序型 → 泳道，层级型 → 树，群聊型 → 看板）。

yaml

annotation_schemes:
  - annotation_type: radio
    name: orchestration_pattern
    description: "Which orchestration pattern does this run actually follow?"
    labels: [single_agent, sequential_pipeline, hierarchical_manager, group_chat, blackboard, debate, hub_and_spoke]
    has_free_response: true

多智能体团队评估

交互图（agent_interaction_graph）

跨智能体失败归因（failure_attribution）

交接审查（handoff_review）

单智能体与整队评分卡（agent_scorecard）

工具／资源争用时间线（tool_contention）

跨泳道涌现行为（emergent_behavior）

工具调用审查（tool_call_review）

步骤粒度的 MAST 标记

选择编排视角

相关内容

交互图（`agent_interaction_graph`）

跨智能体失败归因（`failure_attribution`）

交接审查（`handoff_review`）

单智能体与整队评分卡（`agent_scorecard`）

工具／资源争用时间线（`tool_contention`）

跨泳道涌现行为（`emergent_behavior`）

工具调用审查（`tool_call_review`）