评估 AI 智能体:智能体 Trace 人类标注完整指南
学习如何使用 Potato 的智能体标注功能设置 AI 智能体输出的人类评估,从 trace 导入到标注方案设计。
评估 AI 智能体:智能体 Trace 人类标注完整指南
自动化指标告诉你智能体多久成功一次。人类评估告诉你它为什么失败、在哪里出错以及如何修复。本指南将引导你使用 Potato 的智能体标注功能设置完整的 AI 智能体人类评估流程。
我们将评估一个通过搜索网络回答问题的 ReAct 风格智能体。在本指南结束时,你将能够:
- 将智能体 trace 导入 Potato
- 配置智能体 trace 显示
- 设置整体和逐步标注方案
- 运行标注任务
- 导出结构化评估结果
前提条件
安装 Potato 2.3.0 或更高版本:
pip install --upgrade potato-annotation你还需要待评估的智能体 trace。本指南使用 ReAct trace,但同样的方法适用于 12 种支持格式中的任何一种。
第 1 步:准备智能体 Trace
ReAct trace 是 Thought、Action 和 Observation 步骤的序列。以下是一个最简示例。创建文件 data/traces.jsonl:
{
"trace_id": "trace_001",
"task": "What is the capital of Australia and what is its population?",
"trace": [
{"type": "thought", "content": "I need to find the capital of Australia and its population. Let me search for this."},
{"type": "action", "content": "search", "action_input": "capital of Australia"},
{"type": "observation", "content": "Canberra is the capital city of Australia. It was chosen as the capital in 1908 as a compromise between Sydney and Melbourne."},
{"type": "thought", "content": "The capital is Canberra. Now I need to find its population."},
{"type": "action", "content": "search", "action_input": "Canberra population 2025"},
{"type": "observation", "content": "Canberra has a population of approximately 467,000 as of 2025, making it the eighth-largest city in Australia."},
{"type": "thought", "content": "I now have both pieces of information. The capital of Australia is Canberra with a population of approximately 467,000."},
{"type": "action", "content": "finish", "action_input": "The capital of Australia is Canberra, with a population of approximately 467,000 as of 2025."}
],
"ground_truth": "Canberra, approximately 467,000"
}JSONL 文件中的每一行是一个完整的智能体 trace。trace 字段包含逐步日志。task 字段是智能体被要求完成的任务。
Trace 格式说明
对于 OpenAI 函数调用 trace,格式有所不同:
{
"trace_id": "oai_001",
"task": "Find cheap flights from NYC to London",
"messages": [
{"role": "user", "content": "Find cheap flights from NYC to London"},
{"role": "assistant", "content": null, "tool_calls": [{"function": {"name": "search_flights", "arguments": "{\"from\": \"NYC\", \"to\": \"LHR\"}"}}]},
{"role": "tool", "name": "search_flights", "content": "{\"flights\": [{\"airline\": \"BA\", \"price\": 450}, {\"airline\": \"AA\", \"price\": 520}]}"},
{"role": "assistant", "content": "I found flights from NYC to London. The cheapest is British Airways at $450."}
]
}Potato 的转换器处理这些差异。你只需指定正确的转换器名称。
第 2 步:创建项目配置
创建 config.yaml:
task_name: "ReAct Agent Evaluation"
task_dir: "."
data_files:
- "data/traces.jsonl"
item_properties:
id_key: trace_id
text_key: task
# --- Agentic annotation settings ---
agentic:
enabled: true
trace_converter: react
display_type: agent_trace
agent_trace_display:
colors:
thought: "#6E56CF"
action: "#3b82f6"
observation: "#22c55e"
error: "#ef4444"
collapse_observations: true
collapse_threshold: 400
show_step_numbers: true
show_timestamps: false
render_json: true
syntax_highlight: true这告诉 Potato:
- 从
data/traces.jsonl加载 trace - 使用 ReAct 转换器解析
trace字段 - 使用带颜色编码步骤卡片的 agent trace 显示来展示 trace
第 3 步:设计标注方案
智能体评估通常需要 trace 级别 的判断(智能体是否成功?)和 步骤级别 的判断(每个步骤是否正确?)。让我们同时添加两者。
将以下内容添加到 config.yaml:
annotation_schemes:
# --- Trace-level schemas ---
# 1. Task success (the most important metric)
- annotation_type: radio
name: task_success
description: "Did the agent successfully complete the task?"
labels:
- "Success"
- "Partial Success"
- "Failure"
label_requirement:
required: true
sequential_key_binding: true
# 2. Answer correctness (if the task has a ground truth)
- annotation_type: radio
name: answer_correctness
description: "Is the agent's final answer factually correct?"
labels:
- "Correct"
- "Partially Correct"
- "Incorrect"
- "Cannot Determine"
label_requirement:
required: true
# 3. Efficiency rating
- annotation_type: likert
name: efficiency
description: "Did the agent use an efficient path to the answer?"
min: 1
max: 5
labels:
1: "Very Inefficient (many unnecessary steps)"
3: "Average"
5: "Optimal (no wasted steps)"
# 4. Free-text notes
- annotation_type: text
name: evaluator_notes
description: "Any additional observations"
label_requirement:
required: false
# --- Step-level schemas ---
# 5. Per-step correctness
- annotation_type: per_turn_rating
name: step_correctness
target: agentic_steps
description: "Was this step correct and useful?"
rating_type: radio
labels:
- "Correct"
- "Partially Correct"
- "Incorrect"
- "Unnecessary"
# 6. Per-step error type (only shown when step is not correct)
- annotation_type: per_turn_rating
name: error_type
target: agentic_steps
description: "What type of error occurred?"
rating_type: multiselect
labels:
- "Wrong tool/action"
- "Wrong arguments"
- "Hallucinated information"
- "Reasoning error"
- "Redundant step"
- "Premature termination"
- "Other"
conditional:
show_when:
step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]这个方案设计为你提供:
- 用于高级分析的二元成功/失败指标
- 用于评估最终答案的正确性评分
- 用于比较智能体策略的效率分数
- 逐步评分,精确识别智能体在哪里出错
- 仅在某步骤有问题时才出现的条件性错误分类
第 4 步:配置输出并启动服务器
将输出设置添加到 config.yaml:
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
# Optional: also export to Parquet for analysis
parquet_export:
enabled: true
output_dir: "output/parquet/"
compression: zstd完整的 config.yaml 供参考:
task_name: "ReAct Agent Evaluation"
task_dir: "."
data_files:
- "data/traces.jsonl"
item_properties:
id_key: trace_id
text_key: task
agentic:
enabled: true
trace_converter: react
display_type: agent_trace
agent_trace_display:
colors:
thought: "#6E56CF"
action: "#3b82f6"
observation: "#22c55e"
error: "#ef4444"
collapse_observations: true
collapse_threshold: 400
show_step_numbers: true
render_json: true
syntax_highlight: true
annotation_schemes:
- annotation_type: radio
name: task_success
description: "Did the agent successfully complete the task?"
labels: ["Success", "Partial Success", "Failure"]
label_requirement:
required: true
sequential_key_binding: true
- annotation_type: radio
name: answer_correctness
description: "Is the agent's final answer factually correct?"
labels: ["Correct", "Partially Correct", "Incorrect", "Cannot Determine"]
label_requirement:
required: true
- annotation_type: likert
name: efficiency
description: "Did the agent use an efficient path?"
min: 1
max: 5
labels:
1: "Very Inefficient"
3: "Average"
5: "Optimal"
- annotation_type: text
name: evaluator_notes
description: "Any additional observations"
label_requirement:
required: false
- annotation_type: per_turn_rating
name: step_correctness
target: agentic_steps
description: "Was this step correct?"
rating_type: radio
labels: ["Correct", "Partially Correct", "Incorrect", "Unnecessary"]
- annotation_type: per_turn_rating
name: error_type
target: agentic_steps
description: "Error type"
rating_type: multiselect
labels:
- "Wrong tool/action"
- "Wrong arguments"
- "Hallucinated information"
- "Reasoning error"
- "Redundant step"
- "Premature termination"
- "Other"
conditional:
show_when:
step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
parquet_export:
enabled: true
output_dir: "output/parquet/"
compression: zstd启动服务器:
potato start config.yaml -p 8000在浏览器中打开 http://localhost:8000。
第 5 步:标注工作流
当标注人员打开一个 trace 时,他们会看到:
- 顶部的 任务描述(原始用户查询)
- 步骤卡片 显示完整的智能体 trace,按类型颜色编码:
- 紫色卡片表示思考/推理
- 蓝色卡片表示操作/工具调用
- 绿色卡片表示观测/结果
- 红色卡片表示错误
- 每个步骤卡片旁边的 逐步评分控件
- trace 显示下方的 trace 级别标注方案
典型工作流:
- 阅读任务描述以理解智能体应该做什么
- 逐步浏览 trace,对每个步骤进行评分
- 对于评为"部分正确"或"不正确"的步骤,选择错误类型
- 对整体 trace 进行评分(成功、正确性、效率)
- 如需要添加备注
- 提交并进入下一个 trace
标注人员提示
- 展开折叠的观测结果 以验证智能体是否正确处理了信息
- 将最终答案与标准答案对比(如果有的话),然后再评定任务成功
- 将"不必要"的步骤 与"不正确"的步骤分开评分——不必要的步骤浪费精力但不会引入错误
- 使用步骤时间线 侧边栏跳转到长 trace 中的特定步骤
第 6 步:分析结果
标注完成后,可以程序化地分析结果。
使用 pandas 进行基础分析
import pandas as pd
import json
# Load annotations
annotations = []
with open("output/annotations.jsonl") as f:
for line in f:
annotations.append(json.loads(line))
df = pd.DataFrame(annotations)
# Task success rate
success_counts = df.groupby("annotations").apply(
lambda x: x.iloc[0]["annotations"]["task_success"]
).value_counts()
print("Task Success Distribution:")
print(success_counts)
# Average efficiency rating
efficiency_scores = [
a["annotations"]["efficiency"]
for a in annotations
if "efficiency" in a["annotations"]
]
print(f"\nAverage Efficiency: {sum(efficiency_scores) / len(efficiency_scores):.2f}")步骤级错误分析
# Collect all step-level errors
error_counts = {}
for ann in annotations:
step_errors = ann["annotations"].get("error_type", {})
for step_idx, errors in step_errors.items():
for error in errors:
error_counts[error] = error_counts.get(error, 0) + 1
print("Error Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
print(f" {error}: {count}")使用 DuckDB 分析(通过 Parquet)
import duckdb
# Overall success rate
result = duckdb.sql("""
SELECT value, COUNT(*) as count
FROM 'output/parquet/annotations.parquet'
WHERE schema_name = 'task_success'
GROUP BY value
ORDER BY count DESC
""")
print(result)第 7 步:扩大规模
对于较大的评估项目(数百或数千个 trace),考虑以下配置:
多标注者
为每个 trace 分配多个标注者以计算标注者间一致性:
annotation_task_config:
total_annotations_per_instance: 3
assignment_strategy: random使用预置方案
快速设置时,使用 Potato 的预置智能体评估方案:
annotation_schemes:
- preset: agent_task_success
- preset: agent_step_correctness
- preset: agent_error_taxonomy
- preset: agent_efficiency质量控制
启用金标准实例进行质量监控:
phases:
training:
enabled: true
data_file: "data/training_traces.jsonl"
passing_criteria:
min_correct: 4
total_questions: 5适配其他智能体类型
OpenAI 函数调用
agentic:
enabled: true
trace_converter: openai
display_type: agent_traceAnthropic 工具使用
agentic:
enabled: true
trace_converter: anthropic
display_type: agent_trace多智能体系统(CrewAI/AutoGen)
agentic:
enabled: true
trace_converter: multi_agent
display_type: agent_trace
multi_agent:
agent_converters:
researcher: react
writer: anthropic
reviewer: openai网页浏览智能体
对于网页智能体,切换到 web agent 显示:
agentic:
enabled: true
trace_converter: webarena
display_type: web_agent
web_agent_display:
screenshot_max_width: 900
overlay:
enabled: true
filmstrip:
enabled: true详见标注网页浏览智能体的专门指南。
总结
AI 智能体的人类评估需要专门的工具。Potato 的智能体标注系统提供:
- 12 种转换器 标准化来自任何框架的 trace
- 3 种显示类型 分别针对工具使用、网页浏览和对话智能体优化
- 逐步评分 用于步骤级评估
- 9 个预置方案 覆盖常见评估维度
- Parquet 导出 用于高效的下游分析
关键洞察在于,智能体评估不仅仅是"智能体是否得到了正确答案?"——而是"智能体在每个步骤中是否进行了正确推理?"逐步标注揭示了聚合指标遗漏的错误模式。
延伸阅读
- 智能体标注文档
- 标注网页浏览智能体
- Solo Mode -- 将智能体标注与人机协同评估相结合
- Best-Worst Scaling -- 对智能体输出进行比较排名
- Parquet 导出 -- 高效的分析导出