Skip to content
Guides7 min read

评估 AI 智能体:智能体 Trace 人类标注完整指南

学习如何使用 Potato 的智能体标注功能设置 AI 智能体输出的人类评估,从 trace 导入到标注方案设计。

Potato Team·

评估 AI 智能体:智能体 Trace 人类标注完整指南

自动化指标告诉你智能体多久成功一次。人类评估告诉你它为什么失败、在哪里出错以及如何修复。本指南将引导你使用 Potato 的智能体标注功能设置完整的 AI 智能体人类评估流程。

我们将评估一个通过搜索网络回答问题的 ReAct 风格智能体。在本指南结束时,你将能够:

  • 将智能体 trace 导入 Potato
  • 配置智能体 trace 显示
  • 设置整体和逐步标注方案
  • 运行标注任务
  • 导出结构化评估结果

前提条件

安装 Potato 2.3.0 或更高版本:

bash
pip install --upgrade potato-annotation

你还需要待评估的智能体 trace。本指南使用 ReAct trace,但同样的方法适用于 12 种支持格式中的任何一种。


第 1 步:准备智能体 Trace

ReAct trace 是 Thought、Action 和 Observation 步骤的序列。以下是一个最简示例。创建文件 data/traces.jsonl

json
{
  "trace_id": "trace_001",
  "task": "What is the capital of Australia and what is its population?",
  "trace": [
    {"type": "thought", "content": "I need to find the capital of Australia and its population. Let me search for this."},
    {"type": "action", "content": "search", "action_input": "capital of Australia"},
    {"type": "observation", "content": "Canberra is the capital city of Australia. It was chosen as the capital in 1908 as a compromise between Sydney and Melbourne."},
    {"type": "thought", "content": "The capital is Canberra. Now I need to find its population."},
    {"type": "action", "content": "search", "action_input": "Canberra population 2025"},
    {"type": "observation", "content": "Canberra has a population of approximately 467,000 as of 2025, making it the eighth-largest city in Australia."},
    {"type": "thought", "content": "I now have both pieces of information. The capital of Australia is Canberra with a population of approximately 467,000."},
    {"type": "action", "content": "finish", "action_input": "The capital of Australia is Canberra, with a population of approximately 467,000 as of 2025."}
  ],
  "ground_truth": "Canberra, approximately 467,000"
}

JSONL 文件中的每一行是一个完整的智能体 trace。trace 字段包含逐步日志。task 字段是智能体被要求完成的任务。

Trace 格式说明

对于 OpenAI 函数调用 trace,格式有所不同:

json
{
  "trace_id": "oai_001",
  "task": "Find cheap flights from NYC to London",
  "messages": [
    {"role": "user", "content": "Find cheap flights from NYC to London"},
    {"role": "assistant", "content": null, "tool_calls": [{"function": {"name": "search_flights", "arguments": "{\"from\": \"NYC\", \"to\": \"LHR\"}"}}]},
    {"role": "tool", "name": "search_flights", "content": "{\"flights\": [{\"airline\": \"BA\", \"price\": 450}, {\"airline\": \"AA\", \"price\": 520}]}"},
    {"role": "assistant", "content": "I found flights from NYC to London. The cheapest is British Airways at $450."}
  ]
}

Potato 的转换器处理这些差异。你只需指定正确的转换器名称。


第 2 步:创建项目配置

创建 config.yaml

yaml
task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
# --- Agentic annotation settings ---
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
 
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    show_timestamps: false
    render_json: true
    syntax_highlight: true

这告诉 Potato:

  1. data/traces.jsonl 加载 trace
  2. 使用 ReAct 转换器解析 trace 字段
  3. 使用带颜色编码步骤卡片的 agent trace 显示来展示 trace

第 3 步:设计标注方案

智能体评估通常需要 trace 级别 的判断(智能体是否成功?)和 步骤级别 的判断(每个步骤是否正确?)。让我们同时添加两者。

将以下内容添加到 config.yaml

yaml
annotation_schemes:
  # --- Trace-level schemas ---
 
  # 1. Task success (the most important metric)
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    label_requirement:
      required: true
    sequential_key_binding: true
 
  # 2. Answer correctness (if the task has a ground truth)
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Cannot Determine"
    label_requirement:
      required: true
 
  # 3. Efficiency rating
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path to the answer?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient (many unnecessary steps)"
      3: "Average"
      5: "Optimal (no wasted steps)"
 
  # 4. Free-text notes
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  # --- Step-level schemas ---
 
  # 5. Per-step correctness
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct and useful?"
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"
 
  # 6. Per-step error type (only shown when step is not correct)
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "What type of error occurred?"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]

这个方案设计为你提供:

  • 用于高级分析的二元成功/失败指标
  • 用于评估最终答案的正确性评分
  • 用于比较智能体策略的效率分数
  • 逐步评分,精确识别智能体在哪里出错
  • 仅在某步骤有问题时才出现的条件性错误分类

第 4 步:配置输出并启动服务器

将输出设置添加到 config.yaml

yaml
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
# Optional: also export to Parquet for analysis
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

完整的 config.yaml 供参考:

yaml
task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    render_json: true
    syntax_highlight: true
 
annotation_schemes:
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels: ["Success", "Partial Success", "Failure"]
    label_requirement:
      required: true
    sequential_key_binding: true
 
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels: ["Correct", "Partially Correct", "Incorrect", "Cannot Determine"]
    label_requirement:
      required: true
 
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient"
      3: "Average"
      5: "Optimal"
 
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct?"
    rating_type: radio
    labels: ["Correct", "Partially Correct", "Incorrect", "Unnecessary"]
 
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "Error type"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

启动服务器:

bash
potato start config.yaml -p 8000

在浏览器中打开 http://localhost:8000


第 5 步:标注工作流

当标注人员打开一个 trace 时,他们会看到:

  1. 顶部的 任务描述(原始用户查询)
  2. 步骤卡片 显示完整的智能体 trace,按类型颜色编码:
    • 紫色卡片表示思考/推理
    • 蓝色卡片表示操作/工具调用
    • 绿色卡片表示观测/结果
    • 红色卡片表示错误
  3. 每个步骤卡片旁边的 逐步评分控件
  4. trace 显示下方的 trace 级别标注方案

典型工作流:

  1. 阅读任务描述以理解智能体应该做什么
  2. 逐步浏览 trace,对每个步骤进行评分
  3. 对于评为"部分正确"或"不正确"的步骤,选择错误类型
  4. 对整体 trace 进行评分(成功、正确性、效率)
  5. 如需要添加备注
  6. 提交并进入下一个 trace

标注人员提示

  • 展开折叠的观测结果 以验证智能体是否正确处理了信息
  • 将最终答案与标准答案对比(如果有的话),然后再评定任务成功
  • 将"不必要"的步骤 与"不正确"的步骤分开评分——不必要的步骤浪费精力但不会引入错误
  • 使用步骤时间线 侧边栏跳转到长 trace 中的特定步骤

第 6 步:分析结果

标注完成后,可以程序化地分析结果。

使用 pandas 进行基础分析

python
import pandas as pd
import json
 
# Load annotations
annotations = []
with open("output/annotations.jsonl") as f:
    for line in f:
        annotations.append(json.loads(line))
 
df = pd.DataFrame(annotations)
 
# Task success rate
success_counts = df.groupby("annotations").apply(
    lambda x: x.iloc[0]["annotations"]["task_success"]
).value_counts()
print("Task Success Distribution:")
print(success_counts)
 
# Average efficiency rating
efficiency_scores = [
    a["annotations"]["efficiency"]
    for a in annotations
    if "efficiency" in a["annotations"]
]
print(f"\nAverage Efficiency: {sum(efficiency_scores) / len(efficiency_scores):.2f}")

步骤级错误分析

python
# Collect all step-level errors
error_counts = {}
for ann in annotations:
    step_errors = ann["annotations"].get("error_type", {})
    for step_idx, errors in step_errors.items():
        for error in errors:
            error_counts[error] = error_counts.get(error, 0) + 1
 
print("Error Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
    print(f"  {error}: {count}")

使用 DuckDB 分析(通过 Parquet)

python
import duckdb
 
# Overall success rate
result = duckdb.sql("""
    SELECT value, COUNT(*) as count
    FROM 'output/parquet/annotations.parquet'
    WHERE schema_name = 'task_success'
    GROUP BY value
    ORDER BY count DESC
""")
print(result)

第 7 步:扩大规模

对于较大的评估项目(数百或数千个 trace),考虑以下配置:

多标注者

为每个 trace 分配多个标注者以计算标注者间一致性:

yaml
annotation_task_config:
  total_annotations_per_instance: 3
  assignment_strategy: random

使用预置方案

快速设置时,使用 Potato 的预置智能体评估方案:

yaml
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_efficiency

质量控制

启用金标准实例进行质量监控:

yaml
phases:
  training:
    enabled: true
    data_file: "data/training_traces.jsonl"
    passing_criteria:
      min_correct: 4
      total_questions: 5

适配其他智能体类型

OpenAI 函数调用

yaml
agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace

Anthropic 工具使用

yaml
agentic:
  enabled: true
  trace_converter: anthropic
  display_type: agent_trace

多智能体系统(CrewAI/AutoGen)

yaml
agentic:
  enabled: true
  trace_converter: multi_agent
  display_type: agent_trace
  multi_agent:
    agent_converters:
      researcher: react
      writer: anthropic
      reviewer: openai

网页浏览智能体

对于网页智能体,切换到 web agent 显示:

yaml
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
    filmstrip:
      enabled: true

详见标注网页浏览智能体的专门指南。


总结

AI 智能体的人类评估需要专门的工具。Potato 的智能体标注系统提供:

  • 12 种转换器 标准化来自任何框架的 trace
  • 3 种显示类型 分别针对工具使用、网页浏览和对话智能体优化
  • 逐步评分 用于步骤级评估
  • 9 个预置方案 覆盖常见评估维度
  • Parquet 导出 用于高效的下游分析

关键洞察在于,智能体评估不仅仅是"智能体是否得到了正确答案?"——而是"智能体在每个步骤中是否进行了正确推理?"逐步标注揭示了聚合指标遗漏的错误模式。


延伸阅读