Skip to content

智能体标注

使用专用 trace 显示、12 种格式转换器和专门构建的标注方案来评估 AI 智能体。

智能体标注

v2.3.0 新增

AI 智能体越来越多地被部署用于复杂的多步骤任务:浏览网页、编写代码、调用 API 和编排子智能体。但评估智能体是否真正做对了需要人类在传统标注工具无法支持的细粒度上进行判断。单个智能体 trace 可能包含数十个步骤、工具调用、中间推理、截图和分支决策。标注人员需要看到所有这些上下文,高效地浏览它们,并在 trace 级别和单个步骤级别提供结构化评估。

Potato 的智能体标注系统通过三个能力来解决这个问题:

  1. 12 种 trace 格式转换器,将来自任何主流框架的智能体日志标准化为统一格式
  2. 3 种专用显示类型,分别为不同的智能体模态(工具使用、网页浏览、聊天)优化
  3. 9 个预置标注方案,覆盖最常见的智能体评估维度

Trace 格式转换器

智能体 trace 的格式因框架不同而差异很大。Potato 提供 12 种转换器,将这些格式标准化为统一的内部表示。你在配置中指定转换器,或让 Potato 自动检测格式。

转换器参考

转换器源格式提取的关键字段
openaiOpenAI Assistants API / 函数调用日志messages、tool_calls、function results
anthropicAnthropic Claude tool_use / Messages APIcontent blocks、tool_use、tool_result
swebenchSWE-bench 任务 tracepatch、test results、trajectory
opentelemetryOpenTelemetry span 导出 (JSON)spans、attributes、events、parent-child
mcpModel Context Protocol 会话tool definitions、call/response pairs
multi_agentCrewAI / AutoGen / LangGraph 多智能体日志agent roles、delegation、message passing
langchainLangChain 回调 tracechain runs、LLM calls、tool invocations
langfuseLangFuse 观测导出generations、spans、scores
reactReAct 风格 Thought/Action/Observation 日志thought、action、action_input、observation
webarenaWebArena / VisualWebArena trace JSONactions、screenshots、DOM snapshots、URLs
atifAgent Trace Interchange Format (ATIF)steps、observations、metadata
raw_web原始浏览器录制 (HAR + 截图)requests、responses、screenshots、timings

配置

在项目配置中指定转换器:

yaml
agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"

trace 文件中的每一行应该是一个包含原始智能体 trace 的 JSON 对象。转换器会处理其余部分。

对于不同智能体使用不同框架的多智能体 trace,你可以指定每个智能体的转换器:

yaml
agentic:
  enabled: true
  trace_converter: multi_agent
  trace_file: "data/multi_agent_traces.jsonl"
  multi_agent:
    agent_converters:
      planner: react
      coder: anthropic
      reviewer: openai

自动检测

如果不确定使用哪个转换器,设置 trace_converter: auto

yaml
agentic:
  enabled: true
  trace_converter: auto
  trace_file: "data/traces.jsonl"

Potato 会检查前 10 个 trace,根据字段签名选择最匹配的转换器。如果置信度低于 80%,会记录警告日志,此时你应该显式指定转换器。

自定义转换器

如果你的智能体框架不在列表中,可以编写 Python 转换器:

python
# converters/my_converter.py
from potato.agentic.base_converter import BaseTraceConverter
 
class MyConverter(BaseTraceConverter):
    name = "my_framework"
 
    def convert(self, raw_trace: dict) -> dict:
        steps = []
        for entry in raw_trace["log"]:
            steps.append({
                "type": entry.get("kind", "action"),
                "content": entry["text"],
                "timestamp": entry.get("ts"),
                "metadata": entry.get("extra", {}),
            })
        return {"steps": steps}

在配置中注册:

yaml
agentic:
  trace_converter: custom
  custom_converter: "converters/my_converter.py:MyConverter"

显示类型

trace 转换完成后,Potato 使用三种专用显示类型之一进行渲染。每种都针对不同的智能体模态进行了优化。

1. Agent Trace 显示

使用工具的智能体(OpenAI 函数调用、Anthropic tool_use、ReAct、LangChain 等)的默认显示。它将每个步骤渲染为按步骤类型颜色编码的卡片。

yaml
agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace
 
  agent_trace_display:
    # Color coding for step types
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
      system: "#6b7280"
 
    # Collapsible sections
    collapse_observations: true
    collapse_threshold: 500    # characters before auto-collapsing
 
    # Step numbering
    show_step_numbers: true
    show_timestamps: true
 
    # Tool call rendering
    render_json: true          # pretty-print JSON arguments
    syntax_highlight: true     # highlight code in observations

功能特性:

  • 步骤卡片 带有颜色左边框指示类型(thought、action、observation、error)
  • 可折叠部分 用于长观测结果或工具输出(可配置阈值)
  • JSON 格式化显示 用于工具调用参数和结构化响应
  • 语法高亮 用于观测结果中的代码块
  • 步骤时间线 侧边栏,一览显示完整 trace
  • 跳转到步骤 导航功能,适用于长 trace

2. Web Agent Trace 显示

专为网页浏览智能体(WebArena、VisualWebArena、原始浏览器录制)构建。渲染截图并配有 SVG 覆盖层,显示智能体点击、输入或滚动的位置。

yaml
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
 
  web_agent_display:
    # Screenshot rendering
    screenshot_max_width: 900
    screenshot_quality: 85
 
    # SVG overlay for agent actions
    overlay:
      enabled: true
      click_marker: "circle"       # circle, crosshair, or arrow
      click_color: "#ef4444"
      click_radius: 20
      type_highlight: "#3b82f6"    # highlight for text input fields
      scroll_indicator: true
 
    # Filmstrip view
    filmstrip:
      enabled: true
      thumbnail_width: 150
      show_action_labels: true
 
    # DOM snapshot display
    show_dom_snapshot: false        # optional raw DOM view
    show_url_bar: true
    show_action_description: true

功能特性:

  • 截图画廊 支持全尺寸查看和缩放
  • SVG 覆盖层 显示点击目标(红色圆圈)、文本输入区域(蓝色高亮)和滚动方向
  • 胶片条视图 在底部显示所有截图缩略图,用于快速导航
  • 操作描述 文本显示在每个截图下方(例如,"点击 'Add to Cart' 按钮")
  • URL 栏 显示每个步骤的当前页面 URL
  • 前后对比 用于修改页面内容的步骤

3. 交互式聊天显示

用于评估对话智能体和聊天机器人。支持两种子模式:实时聊天 模式下标注人员与智能体实时交互,以及 trace 回顾 模式下标注人员评估已录制的对话。

yaml
agentic:
  enabled: true
  display_type: interactive_chat
 
  interactive_chat_display:
    mode: trace_review         # or "live_chat"
 
    # Trace review settings
    trace_review:
      show_system_prompt: false
      show_token_counts: true
      show_latency: true
      message_grouping: turn    # "turn" or "message"
 
    # Live chat settings (when mode: live_chat)
    live_chat:
      proxy: openai             # agent proxy to use
      max_turns: 20
      timeout_seconds: 60
      show_typing_indicator: true
      allow_regenerate: true
 
    # Common settings
    show_role_labels: true
    role_colors:
      user: "#3b82f6"
      assistant: "#6E56CF"
      system: "#6b7280"
      tool: "#22c55e"

Trace 回顾模式 渲染已录制的对话,可选显示每条消息的 token 计数和延迟。标注人员可以评价单个回合或整个对话。

实时聊天模式 通过智能体代理系统(见下文)将标注人员连接到运行中的智能体。标注人员与智能体对话,然后标注产生的对话。


逐步评分

对于对话和多步骤评估,你通常需要对单个回合进行评分,而不仅仅是(或除了)对整体 trace 评分。Potato 支持任何显示类型的逐步标注。

yaml
annotation_schemes:
  # Overall trace rating
  - annotation_type: likert
    name: overall_quality
    description: "Rate the overall quality of this agent trace"
    min: 1
    max: 5
    labels:
      1: "Very Poor"
      5: "Excellent"
 
  # Per-turn ratings
  - annotation_type: per_turn_rating
    name: step_correctness
    description: "Was this step correct?"
    target: agentic_steps        # binds to trace steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"
 
  - annotation_type: per_turn_rating
    name: step_explanation
    description: "Explain any issues with this step"
    target: agentic_steps
    rating_type: text
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]

逐步评分内联显示在每个步骤卡片旁边。conditional 块让你仅在选择了特定评分时显示后续问题,保持界面整洁。

逐步输出格式

逐步标注以步骤索引保存:

json
{
  "id": "trace_042",
  "annotations": {
    "overall_quality": 3,
    "step_correctness": {
      "0": "Correct",
      "1": "Correct",
      "2": "Incorrect",
      "3": "Correct"
    },
    "step_explanation": {
      "2": "The agent searched for the wrong product name"
    }
  }
}

智能体代理系统

对于标注人员与智能体实时交互的在线评估任务,Potato 提供智能体代理层。代理位于标注界面和智能体后端之间,记录完整对话以供后续审查。

yaml
agentic:
  enabled: true
  display_type: interactive_chat
 
  agent_proxy:
    type: openai                 # openai, http, or echo
 
    # OpenAI proxy
    openai:
      model: "gpt-4o"
      api_key: ${OPENAI_API_KEY}
      system_prompt: "You are a helpful customer service agent."
      temperature: 0.7
      max_tokens: 1024

代理类型

OpenAI 代理 将消息转发到 OpenAI 兼容的 API:

yaml
agent_proxy:
  type: openai
  openai:
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    system_prompt: "You are a helpful assistant."
    temperature: 0.7

HTTP 代理 将消息转发到任何 HTTP 端点(你自己的智能体服务器):

yaml
agent_proxy:
  type: http
  http:
    url: "https://my-agent.example.com/chat"
    method: POST
    headers:
      Authorization: "Bearer ${AGENT_API_KEY}"
    request_template:
      messages: "{{messages}}"
      session_id: "{{session_id}}"
    response_path: "response.content"
    timeout_seconds: 30

Echo 代理 将用户的消息回显(用于测试和 UI 开发):

yaml
agent_proxy:
  type: echo
  echo:
    prefix: "[Echo] "
    delay_ms: 500

预置标注方案

Potato 提供 9 个专为智能体评估设计的标注方案。可直接使用或作为自定义方案的起点。

方案类型描述
agent_task_successradio二元成功/失败,带部分成功选项
agent_step_correctnessper_turn_rating (radio)逐步正确/不正确/不必要评分
agent_error_taxonomyper_turn_rating (multiselect)12 类错误分类(错误工具、幻觉、循环等)
agent_safetyradio + text安全违规检测,带严重程度等级
agent_efficiencylikert评估智能体是否使用了高效路径
agent_instruction_followinglikert评估对原始用户指令的遵循程度
agent_explanation_qualitylikert评估智能体推理/解释的质量
agent_web_action_correctnessper_turn_rating (radio)逐步网页操作评估(正确目标、正确操作类型)
agent_conversation_qualitymultirate多维度聊天质量(有用性、准确性、语气、安全性)

按名称加载预置方案:

yaml
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy

或将预置方案与自定义方案组合:

yaml
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
 
  # Custom schema alongside presets
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations about this agent trace"
    label_requirement:
      required: false

完整示例:评估 ReAct 智能体

以下是评估 ReAct 风格智能体 trace 并带逐步评分的完整配置:

yaml
# project config
task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/react_traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task_description
 
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
 
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 300
    show_step_numbers: true
    render_json: true
 
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_efficiency
 
  - annotation_type: text
    name: failure_reason
    description: "If the agent failed, describe what went wrong"
    label_requirement:
      required: false
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

示例输入数据 (data/react_traces.jsonl):

json
{
  "trace_id": "react_001",
  "task_description": "Find the population of Tokyo and compare it to New York City",
  "trace": [
    {"type": "thought", "content": "I need to find the population of both cities. Let me search for Tokyo first."},
    {"type": "action", "content": "search", "action_input": "Tokyo population 2024"},
    {"type": "observation", "content": "Tokyo has a population of approximately 13.96 million in the city proper..."},
    {"type": "thought", "content": "Now I need to find New York City's population."},
    {"type": "action", "content": "search", "action_input": "New York City population 2024"},
    {"type": "observation", "content": "New York City has a population of approximately 8.34 million..."},
    {"type": "thought", "content": "Tokyo (13.96M) has about 67% more people than NYC (8.34M)."},
    {"type": "action", "content": "finish", "action_input": "Tokyo has ~13.96 million people vs NYC's ~8.34 million, making Tokyo about 67% larger by population."}
  ]
}

启动服务器:

bash
potato start config.yaml -p 8000

延伸阅读

有关实现详情,请参阅源文档