网页浏览智能体的运行模态与基于文本的智能体有根本性的不同。它们浏览真实的网页、点击按钮、填写表单并滚动浏览内容。评估它们需要看到智能体看到了什么（页面状态）和智能体做了什么（执行的操作），最好还有视觉覆盖层精确显示智能体点击的位置。

Potato 的 web agent trace 显示专为这一任务而构建。它渲染带有 SVG 操作覆盖层的完整页面截图，提供胶片条视图用于快速导航，并支持对操作正确性的逐步标注。

本指南以评估 WebArena trace 为例，但同样的方法适用于 VisualWebArena、原始浏览器录制和任何其他网页智能体格式。

前提条件

bash

pip install potato-annotation

你需要 WebArena trace 文件，通常包含截图和 JSON 操作日志。如果你使用 VisualWebArena，格式类似但可能包含额外的视觉定位信息。

第 1 步：了解 WebArena Trace 格式

WebArena trace 由每个场景的一个 JSON 文件组成，包含任务描述、操作序列和截图路径。以下是一个简化示例。

创建 data/web_traces.jsonl：

json

{
  "trace_id": "wa_001",
  "task": "Find the cheapest laptop on the electronics store and add it to the cart",
  "website": "shopping",
  "steps": [
    {
      "step": 0,
      "url": "http://shop.example.com/",
      "action_type": "click",
      "action_target": "Electronics category link",
      "element_id": "nav-electronics",
      "coordinates": [245, 82],
      "screenshot": "screenshots/wa_001_step_00.png",
      "dom_snapshot": "dom/wa_001_step_00.html"
    },
    {
      "step": 1,
      "url": "http://shop.example.com/electronics",
      "action_type": "click",
      "action_target": "Laptops subcategory",
      "element_id": "cat-laptops",
      "coordinates": [180, 310],
      "screenshot": "screenshots/wa_001_step_01.png"
    },
    {
      "step": 2,
      "url": "http://shop.example.com/electronics/laptops",
      "action_type": "click",
      "action_target": "Sort by: Price Low to High",
      "element_id": "sort-price-asc",
      "coordinates": [720, 155],
      "screenshot": "screenshots/wa_001_step_02.png"
    },
    {
      "step": 3,
      "url": "http://shop.example.com/electronics/laptops?sort=price_asc",
      "action_type": "click",
      "action_target": "First laptop: 'Budget Pro 14' - $349",
      "element_id": "product-101",
      "coordinates": [400, 380],
      "screenshot": "screenshots/wa_001_step_03.png"
    },
    {
      "step": 4,
      "url": "http://shop.example.com/product/101",
      "action_type": "click",
      "action_target": "Add to Cart button",
      "element_id": "add-to-cart-btn",
      "coordinates": [650, 520],
      "screenshot": "screenshots/wa_001_step_04.png"
    }
  ],
  "success": true,
  "final_screenshot": "screenshots/wa_001_final.png"
}

每个步骤包含截图、执行的操作、目标元素和点击坐标。Potato 使用这些信息来渲染视觉覆盖层。

第 2 步：配置项目

创建 config.yaml：

yaml

annotation_task_name: "WebArena Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/web_traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
# --- Agentic annotation with web display ---
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
 
  web_agent_display:
    # Screenshot rendering
    screenshot_max_width: 900
    screenshot_quality: 85
 
    # SVG overlays
    overlay:
      enabled: true
      click_marker: "circle"
      click_color: "#ef4444"
      click_radius: 20
      type_highlight: "#3b82f6"
      scroll_indicator: true
 
    # Filmstrip navigation
    filmstrip:
      enabled: true
      thumbnail_width: 150
      show_action_labels: true
 
    # Additional display options
    show_url_bar: true
    show_action_description: true
    show_dom_snapshot: false
 
# --- Annotation Schemas ---
annotation_schemes:
  # Overall task evaluation
  - annotation_type: radio
    name: task_success
    description: "Did the agent complete the task successfully?"
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    label_requirement:
      required: true
 
  - annotation_type: radio
    name: task_efficiency
    description: "Was the agent's navigation path efficient?"
    labels:
      - "Optimal path"
      - "Reasonable but not optimal"
      - "Inefficient (unnecessary steps)"
      - "Completely wrong direction"
    label_requirement:
      required: true
 
  # Per-step evaluation
  - annotation_type: per_turn_rating
    name: action_correctness
    target: agentic_steps
    description: "Was this action correct?"
    rating_type: radio
    labels:
      - "Correct"
      - "Acceptable (not optimal but progresses toward goal)"
      - "Incorrect"
      - "Unnecessary"
 
  - annotation_type: per_turn_rating
    name: action_error_type
    target: agentic_steps
    description: "What went wrong?"
    rating_type: multiselect
    labels:
      - "Wrong element clicked"
      - "Wrong page navigated to"
      - "Missed a closer/better option"
      - "Incorrect form input"
      - "Premature task completion"
      - "Unnecessary navigation"
      - "Failed to scroll to target"
      - "Interaction with wrong page section"
      - "Other"
    conditional:
      show_when:
        action_correctness: ["Incorrect", "Unnecessary"]
 
  - annotation_type: per_turn_rating
    name: action_notes
    target: agentic_steps
    description: "Notes on this step"
    rating_type: text
    label_requirement:
      required: false
 
output_annotation_dir: "output/"
export_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"

第 3 步：了解 Web Agent 显示界面

当你打开一个 trace 时，web agent 显示界面包含：

主截图视图

当前步骤的截图以全宽显示（最大 900px）。在其上绘制 SVG 覆盖层：

红色圆圈 在点击坐标处，精确显示智能体点击的位置
蓝色高亮 围绕智能体输入文本的字段
箭头指示器 用于滚动操作，显示方向和幅度

截图下方显示：

URL 栏 显示此步骤的页面 URL
操作描述（例如，"点击 'Electronics category link'，坐标 [245, 82]"）

胶片条

显示底部有一个水平胶片条，展示所有截图的缩略图。每个缩略图有一个小标签指示操作类型（点击、输入、滚动）。点击任意缩略图可跳转到该步骤。

胶片条对于长 trace（10+ 步骤）特别有价值，在这种情况下滚动浏览主视图会很繁琐。

逐步标注

每个截图旁边出现逐步标注控件。对操作进行评分，如果不正确，选择错误类型。

第 4 步：标注工作流

评估 web agent trace 的典型工作流：

阅读任务描述。 了解智能体应该完成什么。
使用胶片条获取概览。 在对单个步骤评分之前，快速扫描所有截图以了解智能体的轨迹。
逐步检查：
- 查看截图以了解页面状态
- 检查 SVG 覆盖层以查看智能体点击了什么
- 阅读操作描述
- 将操作评为"正确"、"可接受"、"不正确"或"不必要"
- 如果不正确，选择错误类型
评估整体 trace。 审查所有步骤后，评定任务成功和效率。
提交并进入下一个 trace。

注意事项

正确的操作 以合理的方式使智能体更接近目标。智能体点击了正确的元素、导航到正确的页面或输入了正确的信息。

可接受的操作 不是最优选择但仍在取得进展。例如，智能体浏览分类页面而不是使用搜索栏——较慢，但仍然可行。

不正确的操作 是错误：点击了错误的元素、导航到不相关的页面或在表单中输入了错误信息。

不必要的操作 对目标没有贡献：点击了某些东西然后立即返回、滚动过了目标或导航到不相关的页面。

第 5 步：错误分类体系

Potato 包含一个专门为 web agent 操作构建的错误分类体系。以下是每个类别的应用方式：

错误类型	描述	示例
点击了错误的元素	智能体点击了不正确的 UI 元素	点击了"平板"而不是"笔记本电脑"
导航到错误的页面	智能体到达了不相关的页面	导航到"关于我们"而不是产品列表
错过了更近/更好的选项	有更好的操作可用	使用分类浏览而不是搜索栏
表单输入不正确	智能体在表单中输入了错误文本	搜索了"labtop"而不是"laptop"
过早完成任务	智能体过早宣布成功	将错误的商品加入购物车并停止
不必要的导航	步骤对目标没有贡献	在分类页面之间访问了首页
未能滚动到目标	目标在视口以下	元素不可见；智能体应该滚动
与错误的页面区域交互	正确的页面但错误的区域	点击了页头而不是主要内容

第 6 步：处理复杂 Trace

长 Trace（15+ 步骤）

对于长 trace，先使用胶片条识别可疑步骤。查找：

URL 意外变化的步骤（错误导航）
智能体似乎在后退的步骤
重复的相似截图（智能体陷入循环）

然后集中对这些步骤进行详细标注。

失败的 Trace

对于智能体失败的 trace，识别第一个不正确的步骤——这是改进智能体最有价值的标注。清楚地标记它并描述智能体应该做什么。

模糊的操作

某些操作在不了解完整页面内容的情况下很难判断。如果有 DOM 快照可用，请启用它：

yaml

web_agent_display:
  show_dom_snapshot: true

这会添加一个可折叠面板显示原始 HTML，在仅凭截图难以判断时有所帮助（例如，智能体点击了一个有多个重叠元素的区域）。

第 7 步：配置 VisualWebArena

VisualWebArena trace 包含额外的视觉定位信息。配置类似但使用视觉定位覆盖层：

yaml

agentic:
  enabled: true
  trace_converter: webarena         # same converter handles both
  display_type: web_agent
 
  web_agent_display:
    screenshot_max_width: 1000
    overlay:
      enabled: true
      click_marker: "crosshair"     # crosshair is better for precise grounding
      click_color: "#ef4444"
      click_radius: 15
      bounding_box: true            # show element bounding box if available
      bounding_box_color: "#f59e0b"
    filmstrip:
      enabled: true
      thumbnail_width: 180

第 8 步：分析结果

按步骤位置分析操作正确性

Web agent 错误通常集中在 trace 的特定位置。分析错误发生的位置：

python

import pandas as pd
import json
 
annotations = []
with open("output/annotations.jsonl") as f:
    for line in f:
        annotations.append(json.loads(line))
 
# Collect per-step correctness by position
step_errors = {}
for ann in annotations:
    correctness = ann["annotations"].get("action_correctness", {})
    for step_idx, label in correctness.items():
        pos = int(step_idx)
        if pos not in step_errors:
            step_errors[pos] = {"Correct": 0, "Acceptable": 0, "Incorrect": 0, "Unnecessary": 0}
        step_errors[pos][label] += 1
 
# Print error rate by step position
print("Error rate by step position:")
for pos in sorted(step_errors.keys()):
    counts = step_errors[pos]
    total = sum(counts.values())
    error_rate = (counts["Incorrect"] + counts["Unnecessary"]) / total
    print(f"  Step {pos}: {error_rate:.1%} error rate ({total} observations)")

错误类型分布

python

error_counts = {}
for ann in annotations:
    errors = ann["annotations"].get("action_error_type", {})
    for step_idx, error_list in errors.items():
        for error in error_list:
            error_counts[error] = error_counts.get(error, 0) + 1
 
print("\nError Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
    print(f"  {error}: {count}")

按网站的成功率

python

# If traces span multiple websites
website_success = {}
for ann in annotations:
    # Assuming website info is in the original trace data
    success = ann["annotations"]["task_success"]
    website = ann.get("metadata", {}).get("website", "unknown")
    if website not in website_success:
        website_success[website] = {"Success": 0, "Partial Success": 0, "Failure": 0}
    website_success[website][success] += 1
 
for website, counts in website_success.items():
    total = sum(counts.values())
    rate = counts["Success"] / total
    print(f"{website}: {rate:.1%} success rate")

第 9 步：扩展评估规模

多标注者与一致性

对于研究论文，为每个 trace 分配多个标注者：

yaml

annotation_task_config:
  total_annotations_per_instance: 3
  assignment_strategy: random

计算任务成功标签的标注者间一致性：

python

from sklearn.metrics import cohen_kappa_score
import pandas as pd
 
df = pd.read_parquet("output/parquet/annotations.parquet")
success = df[df["schema_name"] == "task_success"]
pivot = success.pivot(index="instance_id", columns="annotator", values="value")
 
# Pairwise kappa
annotators = pivot.columns.tolist()
for i in range(len(annotators)):
    for j in range(i + 1, len(annotators)):
        mask = pivot[[annotators[i], annotators[j]]].dropna()
        kappa = cohen_kappa_score(mask[annotators[i]], mask[annotators[j]])
        print(f"Kappa ({annotators[i]} vs {annotators[j]}): {kappa:.3f}")

结合 Solo Mode

对于大规模评估（500+ trace），使用 Solo Mode 让 LLM 处理简单的 trace：

yaml

solo_mode:
  enabled: true
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
  accuracy_threshold: 0.90
 
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent

人类评估困难的 trace；LLM 处理明确的成功和明显的失败。

总结

评估网页浏览智能体需要精确看到智能体看到了什么和做了什么。Potato 的 web agent 显示提供：

完整截图 带 SVG 覆盖层标记点击目标、输入字段和滚动操作
胶片条导航 用于快速概览和随机访问步骤
URL 栏 跟踪智能体的导航路径
逐步标注 配合网页特定的错误分类体系
灵活配置 支持 WebArena、VisualWebArena 和原始浏览器录制

有效的网页智能体评估的关键是视觉覆盖层：如果看不到智能体点击的确切位置，评估人员就无法可靠地判断操作正确性。

标注网页浏览智能体：从 WebArena Trace 到人类评估

前提条件

第 1 步：了解 WebArena Trace 格式

第 2 步：配置项目

第 3 步：了解 Web Agent 显示界面

主截图视图

胶片条

逐步标注

第 4 步：标注工作流

注意事项

第 5 步：错误分类体系

第 6 步：处理复杂 Trace

长 Trace（15+ 步骤）

失败的 Trace

模糊的操作

第 7 步：配置 VisualWebArena

第 8 步：分析结果

按步骤位置分析操作正确性

错误类型分布

按网站的成功率

第 9 步：扩展评估规模

多标注者与一致性

结合 Solo Mode

总结

延伸阅读