智能体标注

在 Potato 中评估 AI 智能体，配备 13 种 trace 格式转换器、5 种显示类型，以及面向工具使用、网页浏览、编码和聊天智能体的预置方案。包含 PRM 与评分量表评估。

v2.3.0 新增

AI 智能体越来越多地被部署用于复杂的多步骤任务：浏览网页、编写代码、调用 API 和编排子智能体。但评估智能体是否真正做对了需要人类在传统标注工具无法支持的细粒度上进行判断。单个智能体 trace 可能包含数十个步骤、工具调用、中间推理、截图和分支决策。标注人员需要看到所有这些上下文，高效地浏览它们，并在 trace 级别和单个步骤级别提供结构化评估。

Potato 的智能体标注系统通过四项能力来解决这个问题：

13 种 trace 格式转换器，将来自任何主流框架的智能体日志标准化为统一格式
5 种专用显示类型，分别为不同的智能体模态（工具使用、网页浏览、编码、聊天、实时观察）优化
9 个预置标注方案，覆盖最常见的智能体评估维度
4 种专门构建的标注类型，用于高级评估：轨迹评估、评分量表评估、成对比较和过程奖励标注

Trace 格式转换器

智能体 trace 的格式因框架不同而差异很大。Potato 提供 13 种转换器，将这些格式标准化为统一的内部表示。你在配置中指定转换器，或让 Potato 自动检测格式。

转换器参考

转换器	源格式	提取的关键字段
`openai`	OpenAI Assistants API / function calling logs	messages、tool_calls、function results
`anthropic`	Anthropic Claude tool_use / Messages API	content blocks、tool_use、tool_result
`swebench`	SWE-bench task traces	patch、test results、trajectory
`opentelemetry`	OpenTelemetry span exports (JSON)	spans、attributes、events、parent-child
`mcp`	Model Context Protocol sessions	tool definitions、call/response pairs
`multi_agent`	CrewAI / AutoGen / LangGraph multi-agent logs	agent roles、delegation、message passing
`langchain`	LangChain callback traces	chain runs、LLM calls、tool invocations
`langfuse`	LangFuse observation exports	generations、spans、scores
`react`	ReAct-style Thought/Action/Observation logs	thought、action、action_input、observation
`webarena`	WebArena / VisualWebArena trace JSON	actions、screenshots、DOM snapshots、URLs
`atif`	Agent Trace Interchange Format (ATIF)	steps、observations、metadata
`raw_web`	Raw browser recordings (HAR + screenshots)	requests、responses、screenshots、timings
`claude_code`	Claude Code / Aider / coding agents	tool_use blocks、diffs、terminal output

配置

在项目配置中指定转换器：

yaml

agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"

trace 文件中的每一行应该是一个包含原始智能体 trace 的 JSON 对象。转换器会处理其余部分。

对于不同智能体使用不同框架的多智能体 trace，你可以指定每个智能体的转换器：

yaml

agentic:
  enabled: true
  trace_converter: multi_agent
  trace_file: "data/multi_agent_traces.jsonl"
  multi_agent:
    agent_converters:
      planner: react
      coder: anthropic
      reviewer: openai

自动检测

如果不确定使用哪个转换器，设置 trace_converter: auto：

yaml

agentic:
  enabled: true
  trace_converter: auto
  trace_file: "data/traces.jsonl"

Potato 会检查前 10 个 trace，根据字段签名选择最匹配的转换器。如果置信度低于 80%，会记录警告日志，此时你应该显式指定转换器。

自定义转换器

如果你的智能体框架不在列表中，可以编写 Python 转换器：

python

# converters/my_converter.py
from potato.agentic.base_converter import BaseTraceConverter
 
class MyConverter(BaseTraceConverter):
    name = "my_framework"
 
    def convert(self, raw_trace: dict) -> dict:
        steps = []
        for entry in raw_trace["log"]:
            steps.append({
                "type": entry.get("kind", "action"),
                "content": entry["text"],
                "timestamp": entry.get("ts"),
                "metadata": entry.get("extra", {}),
            })
        return {"steps": steps}

在配置中注册：

yaml

agentic:
  trace_converter: custom
  custom_converter: "converters/my_converter.py:MyConverter"

显示类型

trace 转换完成后，Potato 使用五种专用显示类型之一进行渲染。每种都针对不同的智能体模态进行了优化。

1. Agent Trace 显示

使用工具的智能体（OpenAI 函数调用、Anthropic tool_use、ReAct、LangChain 等）的默认显示。它将每个步骤渲染为按步骤类型颜色编码的卡片。

yaml

agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace
 
  agent_trace_display:
    # Color coding for step types
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
      system: "#6b7280"
 
    # Collapsible sections
    collapse_observations: true
    collapse_threshold: 500    # characters before auto-collapsing
 
    # Step numbering
    show_step_numbers: true
    show_timestamps: true
 
    # Tool call rendering
    render_json: true          # pretty-print JSON arguments
    syntax_highlight: true     # highlight code in observations

功能特性：

步骤卡片 带有颜色左边框指示类型（thought、action、observation、error）
可折叠部分 用于长观测结果或工具输出（可配置阈值）
JSON 格式化显示 用于工具调用参数和结构化响应
语法高亮 用于观测结果中的代码块
步骤时间线 侧边栏，一览显示完整 trace
跳转到步骤 导航功能，适用于长 trace

2. Web Agent Trace 显示

专为网页浏览智能体（WebArena、VisualWebArena、原始浏览器录制）构建。渲染截图并配有 SVG 覆盖层，显示智能体点击、输入或滚动的位置。

yaml

agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
 
  web_agent_display:
    # Screenshot rendering
    screenshot_max_width: 900
    screenshot_quality: 85
 
    # SVG overlay for agent actions
    overlay:
      enabled: true
      click_marker: "circle"       # circle, crosshair, or arrow
      click_color: "#ef4444"
      click_radius: 20
      type_highlight: "#3b82f6"    # highlight for text input fields
      scroll_indicator: true
 
    # Filmstrip view
    filmstrip:
      enabled: true
      thumbnail_width: 150
      show_action_labels: true
 
    # DOM snapshot display
    show_dom_snapshot: false        # optional raw DOM view
    show_url_bar: true
    show_action_description: true

功能特性：

截图画廊 支持全尺寸查看和缩放
SVG 覆盖层 显示点击目标（红色圆圈）、文本输入区域（蓝色高亮）和滚动方向
胶片条视图 在底部显示所有截图缩略图，用于快速导航
操作描述 文本显示在每个截图下方（例如，"点击 'Add to Cart' 按钮"）
URL 栏 显示每个步骤的当前页面 URL
前后对比 用于修改页面内容的步骤

3. 交互式聊天显示

用于评估对话智能体和聊天机器人。支持两种子模式：实时聊天 模式下标注人员与智能体实时交互，以及 trace 回顾 模式下标注人员评估已录制的对话。

yaml

agentic:
  enabled: true
  display_type: interactive_chat
 
  interactive_chat_display:
    mode: trace_review         # or "live_chat"
 
    # Trace review settings
    trace_review:
      show_system_prompt: false
      show_token_counts: true
      show_latency: true
      message_grouping: turn    # "turn" or "message"
 
    # Live chat settings (when mode: live_chat)
    live_chat:
      proxy: openai             # agent proxy to use
      max_turns: 20
      timeout_seconds: 60
      show_typing_indicator: true
      allow_regenerate: true
 
    # Common settings
    show_role_labels: true
    role_colors:
      user: "#3b82f6"
      assistant: "#6E56CF"
      system: "#6b7280"
      tool: "#22c55e"

Trace 回顾模式 渲染已录制的对话，可选显示每条消息的 token 计数和延迟。标注人员可以评价单个回合或整个对话。

实时聊天模式 通过智能体代理系统（见下文）将标注人员连接到运行中的智能体。标注人员与智能体对话，然后标注产生的对话。

4. 编码 Trace 显示

专为编码智能体会话（Claude Code、Aider、SWE-Agent）构建。渲染带语法高亮的代码 diff、深色块中的终端输出，以及带行号的文件读取。

yaml

agentic:
  enabled: true
  trace_converter: claude_code
  display_type: coding_trace
 
  coding_trace_display:
    diff_style: unified           # unified or split
    terminal_theme: dark
    show_file_tree: true
    collapse_long_output: true
    collapse_threshold: 50        # lines
    show_line_numbers: true
    syntax_highlight: true

功能特性：

统一 diff 视图 用红/绿高亮显示编辑操作
深色终端块 用于 bash/shell 命令输出
带行号的代码块 用于文件读取操作
文件树侧边栏 显示会话期间触及的所有文件
可折叠长输出 用于冗长的终端或文件内容

完整参考请见编码智能体标注。

5. 实时智能体显示

对 AI 智能体进行实时观察，并提供人工干预的控件。支持网页浏览智能体和编码智能体。

yaml

agentic:
  enabled: true
  display_type: live_agent

功能特性：

实时流式传输 通过 Server-Sent Events 传输智能体动作
暂停/恢复 在步骤之间暂停或恢复智能体
发送指令 在任务进行中重定向智能体
接管手动控制
回滚到任何先前的检查点（编码智能体使用基于 git 的检查点）
分支与重放 从任何检查点用不同指令重新执行

配置详情请见实时智能体评估和实时编码智能体。

高级标注类型

除了逐回合评分和预置方案外，Potato 还包含四种专门构建的标注类型，用于结构化的智能体评估。

轨迹评估 (`trajectory_eval`)

带分层错误分类体系和严重程度评分的逐步错误定位。每个步骤都会得到一个正确性评分、错误类型、严重程度等级和可选理由。一个连续评分计数器会根据严重程度递减。

yaml

annotation_schemes:
  - annotation_type: trajectory_eval
    name: step_eval
    error_taxonomy:
      reasoning:
        - logical_error
        - incorrect_assumption
      action:
        - wrong_tool
        - wrong_arguments
        - premature_termination
    severity_weights:
      minor: -1
      major: -5
      critical: -10

完整指南请见轨迹评估博客文章。

评分量表评估 (`rubric_eval`)

MT-Bench 风格的多准则网格评估。定义自定义准则和评分量表。标注人员独立地对每个准则评分。

yaml

annotation_schemes:
  - annotation_type: rubric_eval
    name: agent_rubric
    criteria:
      - name: correctness
        description: "Did the agent produce the correct result?"
      - name: efficiency
        description: "Did the agent take an efficient path?"
      - name: safety
        description: "Did the agent avoid unsafe actions?"
    scale: 5
    scale_labels:
      1: "Very Poor"
      3: "Acceptable"
      5: "Excellent"

设置说明请见评分量表评估教程。

成对比较

并排比较两个智能体 trace，提供三种模式：

二元：点击选择 A 或 B（可选平局）
量表：从"A 好得多"到"B 好得多"的滑块
多维度：每个维度独立的 A/B/平局，并要求填写理由

yaml

annotation_schemes:
  - annotation_type: pairwise
    name: agent_comparison
    mode: multi_dimension
    dimensions:
      - correctness
      - efficiency
      - safety
    require_justification: true
    allow_tie: true

三种模式的说明请见成对比较指南。

过程奖励标注

逐步二元正确性标注，针对训练过程奖励模型进行了优化。两种模式：首个错误（点击第一个错误步骤，其余自动标记）和逐步（独立评估每个步骤）。

yaml

annotation_schemes:
  - annotation_type: process_reward
    name: prm
    mode: first_error    # or per_step

完整参考请见过程奖励标注。

逐回合评分

对于对话和多步骤评估，你通常需要对单个回合进行评分，而不仅仅是（或除了）对整体 trace 评分。Potato 支持任何显示类型的逐回合标注。

yaml

annotation_schemes:
  # Overall trace rating
  - annotation_type: likert
    name: overall_quality
    description: "Rate the overall quality of this agent trace"
    min: 1
    max: 5
    labels:
      1: "Very Poor"
      5: "Excellent"
 
  # Per-turn ratings
  - annotation_type: per_turn_rating
    name: step_correctness
    description: "Was this step correct?"
    target: agentic_steps        # binds to trace steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"
 
  - annotation_type: per_turn_rating
    name: step_explanation
    description: "Explain any issues with this step"
    target: agentic_steps
    rating_type: text
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]

逐回合评分内联显示在每个步骤卡片旁边。conditional 块让你仅在选择了特定评分时显示后续问题，保持界面整洁。

逐回合输出格式

逐回合标注以步骤索引保存：

json

{
  "id": "trace_042",
  "annotations": {
    "overall_quality": 3,
    "step_correctness": {
      "0": "Correct",
      "1": "Correct",
      "2": "Incorrect",
      "3": "Correct"
    },
    "step_explanation": {
      "2": "The agent searched for the wrong product name"
    }
  }
}

智能体代理系统

对于标注人员与智能体实时交互的在线评估任务，Potato 提供智能体代理层。代理位于标注界面和智能体后端之间，记录完整对话以供后续审查。

yaml

agentic:
  enabled: true
  display_type: interactive_chat
 
  agent_proxy:
    type: openai                 # openai, http, or echo
 
    # OpenAI proxy
    openai:
      model: "gpt-4o"
      api_key: ${OPENAI_API_KEY}
      system_prompt: "You are a helpful customer service agent."
      temperature: 0.7
      max_tokens: 1024

代理类型

OpenAI 代理 将消息转发到 OpenAI 兼容的 API：

yaml

agent_proxy:
  type: openai
  openai:
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    system_prompt: "You are a helpful assistant."
    temperature: 0.7

HTTP 代理 将消息转发到任何 HTTP 端点（你自己的智能体服务器）：

yaml

agent_proxy:
  type: http
  http:
    url: "https://my-agent.example.com/chat"
    method: POST
    headers:
      Authorization: "Bearer ${AGENT_API_KEY}"
    request_template:
      messages: "{{messages}}"
      session_id: "{{session_id}}"
    response_path: "response.content"
    timeout_seconds: 30

Echo 代理 将用户的消息回显（用于测试和 UI 开发）：

yaml

agent_proxy:
  type: echo
  echo:
    prefix: "[Echo] "
    delay_ms: 500

预置标注方案

Potato 提供 9 个专为智能体评估设计的标注方案。可直接使用或作为自定义方案的起点。

方案	类型	描述
`agent_task_success`	radio	二元成功/失败，带部分成功选项
`agent_step_correctness`	per_turn_rating (radio)	逐步正确/不正确/不必要评分
`agent_error_taxonomy`	per_turn_rating (multiselect)	12 类错误分类（错误工具、幻觉、循环等）
`agent_safety`	radio + text	安全违规检测，带严重程度等级
`agent_efficiency`	likert	评估智能体是否使用了高效路径
`agent_instruction_following`	likert	评估对原始用户指令的遵循程度
`agent_explanation_quality`	likert	评估智能体推理/解释的质量
`agent_web_action_correctness`	per_turn_rating (radio)	逐步网页操作评估（正确目标、正确操作类型）
`agent_conversation_quality`	multirate	多维度聊天质量（有用性、准确性、语气、安全性）

按名称加载预置方案：

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy

或将预置方案与自定义方案组合：

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
 
  # Custom schema alongside presets
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations about this agent trace"
    label_requirement:
      required: false

完整示例：评估 ReAct 智能体

以下是评估 ReAct 风格智能体 trace 并带逐步评分的完整配置：

yaml

# project config
task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/react_traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task_description
 
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
 
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 300
    show_step_numbers: true
    render_json: true
 
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_efficiency
 
  - annotation_type: text
    name: failure_reason
    description: "If the agent failed, describe what went wrong"
    label_requirement:
      required: false
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

示例输入数据 (data/react_traces.jsonl)：

json

{
  "trace_id": "react_001",
  "task_description": "Find the population of Tokyo and compare it to New York City",
  "trace": [
    {"type": "thought", "content": "I need to find the population of both cities. Let me search for Tokyo first."},
    {"type": "action", "content": "search", "action_input": "Tokyo population 2024"},
    {"type": "observation", "content": "Tokyo has a population of approximately 13.96 million in the city proper..."},
    {"type": "thought", "content": "Now I need to find New York City's population."},
    {"type": "action", "content": "search", "action_input": "New York City population 2024"},
    {"type": "observation", "content": "New York City has a population of approximately 8.34 million..."},
    {"type": "thought", "content": "Tokyo (13.96M) has about 67% more people than NYC (8.34M)."},
    {"type": "action", "content": "finish", "action_input": "Tokyo has ~13.96 million people vs NYC's ~8.34 million, making Tokyo about 67% larger by population."}
  ]
}

启动服务器：

bash

potato start config.yaml -p 8000

智能体标注

Trace 格式转换器

转换器参考

配置

自动检测

自定义转换器

显示类型

1. Agent Trace 显示

2. Web Agent Trace 显示

3. 交互式聊天显示

4. 编码 Trace 显示

5. 实时智能体显示

高级标注类型

轨迹评估 (trajectory_eval)

评分量表评估 (rubric_eval)

成对比较

过程奖励标注

逐回合评分

逐回合输出格式

智能体代理系统

代理类型

预置标注方案

完整示例：评估 ReAct 智能体

延伸阅读

轨迹评估 (`trajectory_eval`)

评分量表评估 (`rubric_eval`)