Skip to content

多智能体团队评估

按团队结构而非扁平的对话记录来标注多智能体系统。Potato 提供可点击的智能体交互图、跨智能体失败归因、交接审查、单智能体与整队评分卡、工具争用时间线,以及涌现行为标记。

多智能体系统的失败方式与单个智能体不同:故障发生在智能体之间、在某次交接处,或源于团队的组织方式。评估它意味着把结果归因到是哪个智能体、哪一步、哪次交接,而不只是给一段扁平的对话记录打分。 Potato 为此提供了一组专门构建的标注界面:可点击的交互图、失败归因、交接审查、单智能体与整队评分卡、工具争用时间线,以及跨泳道的涌现行为标记。

这些都建立在智能体轨迹视图与 MAST 失败分类体系之上。每个 schema 在渲染时都从轨迹本身推导出其中的智能体、步骤与交接,因此标注者只能从运行中实际发生的内容中进行选择。

交互图(agent_interaction_graph

整次运行渲染为一张有向图:节点是智能体,边是它们之间的消息与交接转移(边越粗表示越频繁),布局根据轨迹自动生成。标注者点击节点以标记关键路径,点击边以在 正常 → 关键 → 有问题 之间循环切换。对于"我如何看清一次多智能体运行的结构"这个问题,这是最清晰的答案,也是通用标注工具不提供的一种界面。

可点击的智能体交互图,带关键路径和被标记的交接Mark the critical path and flag problematic handoffs on a clickable agent-interaction graph

yaml
annotation_schemes:
  - annotation_type: agent_interaction_graph
    name: graph
    description: "Mark the critical path and flag any problematic handoffs."
    steps_key: steps
    agent_key: agent

存储为 {"critical_nodes": [...], "edges": {"A->B": "problematic", ...}}。每个节点和边都可通过键盘聚焦,并有一段实时文本摘要列出关键节点和被标记的边,因此含义绝不会仅靠颜色来传达。

跨智能体失败归因(failure_attribution

当一个团队失败时,有用的标签是来自失败归因文献的**(责任智能体、决定性步骤、原因)**三元组(Zhang 等人,Which Agent Causes Task Failures and When?,ICML 2025,Who&When 数据集)。智能体下拉框和步骤选择器由轨迹自身的回合填充,因此标注者会把失败归因到一个真实的智能体和一个真实的步骤上。

将一次多智能体失败归因到某个智能体、某一步骤和某个原因Attribute a multi-agent failure to the responsible agent, the decisive step, and why

yaml
annotation_schemes:
  - annotation_type: failure_attribution
    name: attribution
    description: "If it failed: which agent, which step, and why?"
    steps_key: steps
    agent_key: agent
    # agents: [Planner, Coder, Reviewer]   # optional static list instead of deriving from the trace

存储为 {"responsible_agent", "decisive_step", "reason"}。将它与一个 radio 结果 schema(成功/失败)搭配使用,使归因只在失败的运行上触发。

交接审查(handoff_review

每一次交接,即一个智能体将控制权传给另一个智能体,都成为一个可标注的一等对象。只要行动智能体在相邻回合之间发生变化,Potato 就会生成一张交接卡片 A → B;标注者标记智能体间的失配并评定交接质量。这些失败模式以 MAST 的智能体间类别和"回声"现象为依据(Zhang 等人,2025)。

带失配标记和质量评分的交接卡片Flag inter-agent misalignment on every handoff and rate its quality

yaml
annotation_schemes:
  - annotation_type: handoff_review
    name: handoffs
    description: "For each handoff: flag any misalignment and rate the quality."
    steps_key: steps
    agent_key: agent
    flags: [info_loss, dropped_constraint, garbling, goal_drift]
    quality_scale: 5

交接在渲染时从轨迹中推导出来,因此无需手动设置。存储为一个 {index, step, from, to, flags, quality} 列表。

单智能体与整队评分卡(agent_scorecard

在两个层级上同时为一次运行打分(MultiAgentBench,Zhou 等人,ACL 2025):每个智能体获得各维度的分数(角色忠实度、贡献度、协调性),团队获得共享维度的分数,并可选地勾选里程碑。智能体的行来自轨迹自身的回合,因此该矩阵与实际参与者相匹配。

带里程碑的单智能体与整队评分卡Score every agent on role fidelity, contribution, and coordination, plus the team and milestones

yaml
annotation_schemes:
  - annotation_type: agent_scorecard
    name: scorecard
    description: "Score each agent, the team, and which milestones were reached."
    steps_key: steps
    agent_key: agent
    scale: 5
    agent_dimensions: [role fidelity, contribution, coordination]
    team_dimensions: [coordination, communication, efficiency]
    milestones: [plan produced, task delegated correctly, result verified]   # optional

存储为 {"agents": {name: {dim: score}}, "team": {dim: score}, "milestones": {name: bool}}

工具/资源争用时间线(tool_contention

跨智能体的并发工具与资源使用会渲染在一条多泳道时间线上,每个智能体一条泳道。两次调用在重叠时间内触及同一资源的区域会跨泳道高亮,并列出以供分类:死锁、循环等待、竞态条件或良性(DPBench,2026)。这就是你捕捉逐回合对话记录所隐藏的并发故障的方式。

带高亮争用区域的逐智能体工具调用时间线Spot deadlocks and race conditions on a per-agent tool-call timeline

yaml
annotation_schemes:
  - annotation_type: tool_contention
    name: contention
    description: "Classify each shared-resource contention region."
    calls_key: calls          # list of {agent, tool, start, end, resource}
    agent_key: agent
    resource_key: resource
    contention_labels: [deadlock, circular_wait, race_condition, benign]

争用区域在渲染时计算得出(相同的 resource、重叠的区间)。存储为 {"contentions": {idx: label}}

跨泳道涌现行为(emergent_behavior

有些失败是集体性的:合谋、群体思维、级联错误、角色漂移。一种涌现行为并不是一段连续的文本跨度;它是一个参与回合的集合,可能来自不同的智能体。对于每一种行为,标注者勾选参与其中的回合并添加备注,即一个以回合集合表达的跨泳道跨度。

将跨智能体的一组回合标记为级联错误Tag collusion, groupthink, and cascading errors across agents and turns

yaml
annotation_schemes:
  - annotation_type: emergent_behavior
    name: emergent
    description: "For each collective behavior, check the turns (across agents) that participate."
    steps_key: steps
    agent_key: agent
    behaviors: [collusion, groupthink, cascading_error, role_drift]
    allow_note: true

存储为 {behavior: {turns: [idx...], note}},只保留非空的行为。

工具调用审查(tool_call_review

逐一评判每次工具或函数调用:是否选对了工具,参数是否正确,顺序是否得当(对标 BFCL v4 / MCPMark)?工具调用在渲染时从轨迹步骤中提取;每一步的 tool_callstool_callaction 都会成为一张带工具名称和美化打印参数的卡片。

对轨迹中每次工具调用的逐调用裁定Judge every tool call: right tool, correct arguments, right order

yaml
annotation_schemes:
  - annotation_type: tool_call_review
    name: tool_review
    description: "Judge each tool call: right tool? correct arguments?"
    steps_key: steps
    # verdict_options: [correct, wrong_tool, wrong_args, wrong_order]   # customizable

存储为一个 {index, step, tool, verdict, notes} 列表。

步骤粒度的 MAST 标记

你无需新的 schema,就能把 14 种模式的 MAST 失败分类体系(Cemri 等人,Why Do Multi-Agent LLM Systems Fail?,2025)绑定到失败发生的确切步骤(因而也绑定到行动智能体)。把现有的逐步 trajectory_eval schema 配置为以 MAST 模式作为它的 error_types,并按三个 MAST 类别分组。将它与 failure_attributionhandoff_review 搭配以实现完整覆盖。

yaml
annotation_schemes:
  - annotation_type: trajectory_eval
    name: mast_steps
    description: "Tag each step with the MAST failure mode(s) it exhibits."
    steps_key: steps
    step_text_key: content
    error_types:
      - name: "Specification & System Design"
        subtypes: ["Disobey task specification", "Disobey role specification", "Step repetition", "Loss of conversation history", "Unaware of termination conditions"]
      - name: "Inter-Agent Misalignment"
        subtypes: ["Conversation reset", "Fail to ask for clarification", "Task derailment", "Information withholding", "Ignored other agent's input", "Reasoning-action mismatch"]
      - name: "Task Verification & Termination"
        subtypes: ["Premature termination", "No or incomplete verification", "Incorrect verification"]

选择编排视角

编排架构往往主导一次运行的结果,因此值得将其作为一等标签来捕获。无需新的 schema:一个 radio 确认或纠正该运行的模式,进而引导评估视角以及轨迹的布局方式(顺序型 → 泳道,层级型 → 树,群聊型 → 看板)。

yaml
annotation_schemes:
  - annotation_type: radio
    name: orchestration_pattern
    description: "Which orchestration pattern does this run actually follow?"
    labels: [single_agent, sequential_pipeline, hierarchical_manager, group_chat, blackboard, debate, hub_and_spoke]
    has_free_response: true

相关内容

如需实现细节,请参阅源文档