自动化指标告诉你智能体多久成功一次。人类评估告诉你它为什么失败、在哪里出错以及如何修复。本指南将引导你使用 Potato 的智能体标注功能设置完整的 AI 智能体人类评估流程。

我们将评估一个通过搜索网络回答问题的 ReAct 风格智能体。在本指南结束时，你将能够：

将智能体 trace 导入 Potato
配置智能体 trace 显示
设置整体和逐步标注方案
运行标注任务
导出结构化评估结果

前提条件

安装 Potato 2.3.0 或更高版本：

bash

pip install --upgrade potato-annotation

你还需要待评估的智能体 trace。本指南使用 ReAct trace，但同样的方法适用于 12 种支持格式中的任何一种。

第 1 步：准备智能体 Trace

ReAct trace 是 Thought、Action 和 Observation 步骤的序列。以下是一个最简示例。创建文件 data/traces.jsonl：

json

{
  "trace_id": "trace_001",
  "task": "What is the capital of Australia and what is its population?",
  "trace": [
    {"type": "thought", "content": "I need to find the capital of Australia and its population. Let me search for this."},
    {"type": "action", "content": "search", "action_input": "capital of Australia"},
    {"type": "observation", "content": "Canberra is the capital city of Australia. It was chosen as the capital in 1908 as a compromise between Sydney and Melbourne."},
    {"type": "thought", "content": "The capital is Canberra. Now I need to find its population."},
    {"type": "action", "content": "search", "action_input": "Canberra population 2025"},
    {"type": "observation", "content": "Canberra has a population of approximately 467,000 as of 2025, making it the eighth-largest city in Australia."},
    {"type": "thought", "content": "I now have both pieces of information. The capital of Australia is Canberra with a population of approximately 467,000."},
    {"type": "action", "content": "finish", "action_input": "The capital of Australia is Canberra, with a population of approximately 467,000 as of 2025."}
  ],
  "ground_truth": "Canberra, approximately 467,000"
}

JSONL 文件中的每一行是一个完整的智能体 trace。trace 字段包含逐步日志。task 字段是智能体被要求完成的任务。

Trace 格式说明

对于 OpenAI 函数调用 trace，格式有所不同：

json

{
  "trace_id": "oai_001",
  "task": "Find cheap flights from NYC to London",
  "messages": [
    {"role": "user", "content": "Find cheap flights from NYC to London"},
    {"role": "assistant", "content": null, "tool_calls": [{"function": {"name": "search_flights", "arguments": "{\"from\": \"NYC\", \"to\": \"LHR\"}"}}]},
    {"role": "tool", "name": "search_flights", "content": "{\"flights\": [{\"airline\": \"BA\", \"price\": 450}, {\"airline\": \"AA\", \"price\": 520}]}"},
    {"role": "assistant", "content": "I found flights from NYC to London. The cheapest is British Airways at $450."}
  ]
}

Potato 的转换器处理这些差异。你只需指定正确的转换器名称。

第 2 步：创建项目配置

创建 config.yaml：

yaml

annotation_task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
# --- Agentic annotation settings ---
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
 
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    show_timestamps: false
    render_json: true
    syntax_highlight: true

这告诉 Potato：

从 data/traces.jsonl 加载 trace
使用 ReAct 转换器解析 trace 字段
使用带颜色编码步骤卡片的 agent trace 显示来展示 trace

第 3 步：设计标注方案

智能体评估通常需要 trace 级别 的判断（智能体是否成功？）和 步骤级别 的判断（每个步骤是否正确？）。让我们同时添加两者。

将以下内容添加到 config.yaml：

yaml

annotation_schemes:
  # --- Trace-level schemas ---
 
  # 1. Task success (the most important metric)
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    label_requirement:
      required: true
    sequential_key_binding: true
 
  # 2. Answer correctness (if the task has a ground truth)
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Cannot Determine"
    label_requirement:
      required: true
 
  # 3. Efficiency rating
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path to the answer?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient (many unnecessary steps)"
      3: "Average"
      5: "Optimal (no wasted steps)"
 
  # 4. Free-text notes
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  # --- Step-level schemas ---
 
  # 5. Per-step correctness
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct and useful?"
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"
 
  # 6. Per-step error type (only shown when step is not correct)
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "What type of error occurred?"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]

这个方案设计为你提供：

用于高级分析的二元成功/失败指标
用于评估最终答案的正确性评分
用于比较智能体策略的效率分数
逐步评分，精确识别智能体在哪里出错
仅在某步骤有问题时才出现的条件性错误分类

第 4 步：配置输出并启动服务器

将输出设置添加到 config.yaml：

yaml

output_annotation_dir: "output/"
export_annotation_format: "jsonl"
 
# Optional: also export to Parquet for analysis
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

完整的 config.yaml 供参考：

yaml

annotation_task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    render_json: true
    syntax_highlight: true
 
annotation_schemes:
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels: ["Success", "Partial Success", "Failure"]
    label_requirement:
      required: true
    sequential_key_binding: true
 
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels: ["Correct", "Partially Correct", "Incorrect", "Cannot Determine"]
    label_requirement:
      required: true
 
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient"
      3: "Average"
      5: "Optimal"
 
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct?"
    rating_type: radio
    labels: ["Correct", "Partially Correct", "Incorrect", "Unnecessary"]
 
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "Error type"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]
 
output_annotation_dir: "output/"
export_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

启动服务器：

bash

potato start config.yaml -p 8000

在浏览器中打开 http://localhost:8000。

第 5 步：标注工作流

当标注人员打开一个 trace 时，他们会看到：

顶部的 任务描述（原始用户查询）
步骤卡片 显示完整的智能体 trace，按类型颜色编码：
- 紫色卡片表示思考/推理
- 蓝色卡片表示操作/工具调用
- 绿色卡片表示观测/结果
- 红色卡片表示错误
每个步骤卡片旁边的 逐步评分控件
trace 显示下方的 trace 级别标注方案

典型工作流：

阅读任务描述以理解智能体应该做什么
逐步浏览 trace，对每个步骤进行评分
对于评为"部分正确"或"不正确"的步骤，选择错误类型
对整体 trace 进行评分（成功、正确性、效率）
如需要添加备注
提交并进入下一个 trace

标注人员提示

展开折叠的观测结果 以验证智能体是否正确处理了信息
将最终答案与标准答案对比（如果有的话），然后再评定任务成功
将"不必要"的步骤 与"不正确"的步骤分开评分——不必要的步骤浪费精力但不会引入错误
使用步骤时间线 侧边栏跳转到长 trace 中的特定步骤

第 6 步：分析结果

标注完成后，可以程序化地分析结果。

使用 pandas 进行基础分析

python

import pandas as pd
import json
 
# Load annotations
annotations = []
with open("output/annotations.jsonl") as f:
    for line in f:
        annotations.append(json.loads(line))
 
df = pd.DataFrame(annotations)
 
# Task success rate
success_counts = df.groupby("annotations").apply(
    lambda x: x.iloc[0]["annotations"]["task_success"]
).value_counts()
print("Task Success Distribution:")
print(success_counts)
 
# Average efficiency rating
efficiency_scores = [
    a["annotations"]["efficiency"]
    for a in annotations
    if "efficiency" in a["annotations"]
]
print(f"\nAverage Efficiency: {sum(efficiency_scores) / len(efficiency_scores):.2f}")

步骤级错误分析

python

# Collect all step-level errors
error_counts = {}
for ann in annotations:
    step_errors = ann["annotations"].get("error_type", {})
    for step_idx, errors in step_errors.items():
        for error in errors:
            error_counts[error] = error_counts.get(error, 0) + 1
 
print("Error Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
    print(f"  {error}: {count}")

使用 DuckDB 分析（通过 Parquet）

python

import duckdb
 
# Overall success rate
result = duckdb.sql("""
    SELECT value, COUNT(*) as count
    FROM 'output/parquet/annotations.parquet'
    WHERE schema_name = 'task_success'
    GROUP BY value
    ORDER BY count DESC
""")
print(result)

第 7 步：扩大规模

对于较大的评估项目（数百或数千个 trace），考虑以下配置：

多标注者

为每个 trace 分配多个标注者以计算标注者间一致性：

yaml

annotation_task_config:
  total_annotations_per_instance: 3
  assignment_strategy: random

使用预置方案

快速设置时，使用 Potato 的预置智能体评估方案：

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_efficiency

质量控制

启用金标准实例进行质量监控：

yaml

phases:
  training:
    enabled: true
    data_file: "data/training_traces.jsonl"
    passing_criteria:
      min_correct: 4
      total_questions: 5

适配其他智能体类型

OpenAI 函数调用

yaml

agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace

Anthropic 工具使用

yaml

agentic:
  enabled: true
  trace_converter: anthropic
  display_type: agent_trace

多智能体系统（CrewAI/AutoGen）

yaml

agentic:
  enabled: true
  trace_converter: multi_agent
  display_type: agent_trace
  multi_agent:
    agent_converters:
      researcher: react
      writer: anthropic
      reviewer: openai

网页浏览智能体

对于网页智能体，切换到 web agent 显示：

yaml

agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
    filmstrip:
      enabled: true

详见标注网页浏览智能体的专门指南。

总结

AI 智能体的人类评估需要专门的工具。Potato 的智能体标注系统提供：

12 种转换器 标准化来自任何框架的 trace
3 种显示类型 分别针对工具使用、网页浏览和对话智能体优化
逐步评分 用于步骤级评估
9 个预置方案 覆盖常见评估维度
Parquet 导出 用于高效的下游分析

关键洞察在于，智能体评估不仅仅是"智能体是否得到了正确答案？"——而是"智能体在每个步骤中是否进行了正确推理？"逐步标注揭示了聚合指标遗漏的错误模式。

评估 AI 智能体：智能体 Trace 人类标注完整指南

前提条件

第 1 步：准备智能体 Trace

Trace 格式说明

第 2 步：创建项目配置

第 3 步：设计标注方案

第 4 步：配置输出并启动服务器

第 5 步：标注工作流

标注人员提示

第 6 步：分析结果

使用 pandas 进行基础分析

步骤级错误分析

使用 DuckDB 分析（通过 Parquet）

第 7 步：扩大规模

多标注者

使用预置方案

质量控制

适配其他智能体类型

OpenAI 函数调用

Anthropic 工具使用

多智能体系统（CrewAI/AutoGen）

网页浏览智能体

总结

延伸阅读