Skip to content
Blog/Guides
Guides10 min read

Annotating Web Browsing Agents: From WebArena Traces to Human Evaluation

How to use Potato's web agent trace display to evaluate autonomous web browsing agents, with step-by-step screenshots, SVG overlays, and per-step annotation schemas.

By Potato Team·

Annotating Web Browsing Agents: From WebArena Traces to Human Evaluation

Web browsing agents operate in a fundamentally different modality from text-based agents. They navigate real web pages, click buttons, fill forms, and scroll through content. Evaluating them requires seeing what the agent saw (the page state) and what the agent did (the action taken), ideally with visual overlays showing exactly where the agent clicked.

Potato's web agent trace display is purpose-built for this task. It renders full-page screenshots with SVG action overlays, provides a filmstrip view for quick navigation, and supports per-step annotation of action correctness.

This guide walks through evaluating WebArena traces, but the same approach works for VisualWebArena, raw browser recordings, and any other web agent format.


Prerequisites

bash
pip install potato-annotation

You will need WebArena trace files, which typically include screenshots and a JSON action log. If you are working with VisualWebArena, the format is similar but may include additional visual grounding information.


Step 1: Understanding WebArena Trace Format

A WebArena trace consists of a JSON file per episode containing the task description, action sequence, and screenshot paths. Here is a simplified example.

Create data/web_traces.jsonl:

json
{
  "trace_id": "wa_001",
  "task": "Find the cheapest laptop on the electronics store and add it to the cart",
  "website": "shopping",
  "steps": [
    {
      "step": 0,
      "url": "http://shop.example.com/",
      "action_type": "click",
      "action_target": "Electronics category link",
      "element_id": "nav-electronics",
      "coordinates": [245, 82],
      "screenshot": "screenshots/wa_001_step_00.png",
      "dom_snapshot": "dom/wa_001_step_00.html"
    },
    {
      "step": 1,
      "url": "http://shop.example.com/electronics",
      "action_type": "click",
      "action_target": "Laptops subcategory",
      "element_id": "cat-laptops",
      "coordinates": [180, 310],
      "screenshot": "screenshots/wa_001_step_01.png"
    },
    {
      "step": 2,
      "url": "http://shop.example.com/electronics/laptops",
      "action_type": "click",
      "action_target": "Sort by: Price Low to High",
      "element_id": "sort-price-asc",
      "coordinates": [720, 155],
      "screenshot": "screenshots/wa_001_step_02.png"
    },
    {
      "step": 3,
      "url": "http://shop.example.com/electronics/laptops?sort=price_asc",
      "action_type": "click",
      "action_target": "First laptop: 'Budget Pro 14' - $349",
      "element_id": "product-101",
      "coordinates": [400, 380],
      "screenshot": "screenshots/wa_001_step_03.png"
    },
    {
      "step": 4,
      "url": "http://shop.example.com/product/101",
      "action_type": "click",
      "action_target": "Add to Cart button",
      "element_id": "add-to-cart-btn",
      "coordinates": [650, 520],
      "screenshot": "screenshots/wa_001_step_04.png"
    }
  ],
  "success": true,
  "final_screenshot": "screenshots/wa_001_final.png"
}

Each step has a screenshot, the action taken, the target element, and click coordinates. Potato uses this information to render visual overlays.


Step 2: Configure the Project

Create config.yaml:

yaml
task_name: "WebArena Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/web_traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
# --- Agentic annotation with web display ---
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
 
  web_agent_display:
    # Screenshot rendering
    screenshot_max_width: 900
    screenshot_quality: 85
 
    # SVG overlays
    overlay:
      enabled: true
      click_marker: "circle"
      click_color: "#ef4444"
      click_radius: 20
      type_highlight: "#3b82f6"
      scroll_indicator: true
 
    # Filmstrip navigation
    filmstrip:
      enabled: true
      thumbnail_width: 150
      show_action_labels: true
 
    # Additional display options
    show_url_bar: true
    show_action_description: true
    show_dom_snapshot: false
 
# --- Annotation Schemas ---
annotation_schemes:
  # Overall task evaluation
  - annotation_type: radio
    name: task_success
    description: "Did the agent complete the task successfully?"
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    label_requirement:
      required: true
 
  - annotation_type: radio
    name: task_efficiency
    description: "Was the agent's navigation path efficient?"
    labels:
      - "Optimal path"
      - "Reasonable but not optimal"
      - "Inefficient (unnecessary steps)"
      - "Completely wrong direction"
    label_requirement:
      required: true
 
  # Per-step evaluation
  - annotation_type: per_turn_rating
    name: action_correctness
    target: agentic_steps
    description: "Was this action correct?"
    rating_type: radio
    labels:
      - "Correct"
      - "Acceptable (not optimal but progresses toward goal)"
      - "Incorrect"
      - "Unnecessary"
 
  - annotation_type: per_turn_rating
    name: action_error_type
    target: agentic_steps
    description: "What went wrong?"
    rating_type: multiselect
    labels:
      - "Wrong element clicked"
      - "Wrong page navigated to"
      - "Missed a closer/better option"
      - "Incorrect form input"
      - "Premature task completion"
      - "Unnecessary navigation"
      - "Failed to scroll to target"
      - "Interaction with wrong page section"
      - "Other"
    conditional:
      show_when:
        action_correctness: ["Incorrect", "Unnecessary"]
 
  - annotation_type: per_turn_rating
    name: action_notes
    target: agentic_steps
    description: "Notes on this step"
    rating_type: text
    label_requirement:
      required: false
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"

Step 3: Understanding the Web Agent Display

When you open a trace, the web agent display shows:

The Main Screenshot View

The current step's screenshot is displayed at full width (up to 900px). An SVG overlay is drawn on top:

  • Red circle at the click coordinates, showing exactly where the agent clicked
  • Blue highlight around text input fields where the agent typed
  • Arrow indicator for scroll actions showing direction and magnitude

Below the screenshot, you see:

  • URL bar showing the page URL at this step
  • Action description (e.g., "Click on 'Electronics category link' at coordinates [245, 82]")

The Filmstrip

At the bottom of the display, a horizontal filmstrip shows thumbnails of all screenshots. Each thumbnail has a small label indicating the action type (click, type, scroll). Click any thumbnail to jump to that step.

The filmstrip is especially valuable for long traces (10+ steps) where scrolling through the main view would be tedious.

Per-Step Annotation

Next to each screenshot, the per-step annotation controls appear. Rate the action, and if it is incorrect, select the error type.


Step 4: The Annotation Workflow

A typical workflow for evaluating a web agent trace:

  1. Read the task description. Understand what the agent was supposed to accomplish.

  2. Use the filmstrip for an overview. Quickly scan all screenshots to get a sense of the agent's trajectory before rating individual steps.

  3. Walk through each step:

    • Look at the screenshot to understand the page state
    • Check the SVG overlay to see what the agent clicked
    • Read the action description
    • Rate the action as Correct, Acceptable, Incorrect, or Unnecessary
    • If incorrect, select the error type(s)
  4. Rate the overall trace. After reviewing all steps, rate task success and efficiency.

  5. Submit and move to the next trace.

What to Look For

Correct actions move the agent closer to the goal in a reasonable way. The agent clicked the right element, navigated to the right page, or entered the correct information.

Acceptable actions are not the optimal choice but still make progress. For example, the agent browses to a category page instead of using the search bar -- slower, but still viable.

Incorrect actions are mistakes: clicking the wrong element, navigating to an irrelevant page, or entering wrong information in a form.

Unnecessary actions do not contribute to the goal: clicking something and then immediately going back, scrolling past the target, or navigating to pages that are not relevant.


Step 5: Error Taxonomy

Potato includes a purpose-built error taxonomy for web agent actions. Here is how to apply each category:

Error TypeDescriptionExample
Wrong element clickedAgent clicked an incorrect UI elementClicked "Tablets" instead of "Laptops"
Wrong page navigated toAgent ended up on an irrelevant pageNavigated to "About Us" instead of product listing
Missed a closer/better optionA better action was availableUsed category browsing instead of search bar
Incorrect form inputAgent entered wrong text in a formSearched for "labtop" instead of "laptop"
Premature task completionAgent declared success too earlyAdded wrong item to cart and stopped
Unnecessary navigationStep does not contribute to the goalVisited homepage between category pages
Failed to scroll to targetTarget was below the viewportElement was not visible; agent should have scrolled
Interaction with wrong page sectionCorrect page but wrong areaClicked the header instead of the main content

Step 6: Handling Complex Traces

Long Traces (15+ Steps)

For long traces, use the filmstrip to identify suspicious steps first. Look for:

  • Steps where the URL changes unexpectedly (wrong navigation)
  • Steps where the agent appears to go backward
  • Repeated similar screenshots (agent stuck in a loop)

Then focus your detailed annotation on those steps.

Failed Traces

For traces where the agent fails, identify the first incorrect step -- this is the most valuable annotation for improving the agent. Mark it clearly and describe what the agent should have done instead.

Ambiguous Actions

Some actions are hard to judge without knowing the full page content. If the DOM snapshot is available, enable it:

yaml
web_agent_display:
  show_dom_snapshot: true

This adds a collapsible panel showing the raw HTML, which helps when the screenshot alone is ambiguous (e.g., the agent clicked in a region with multiple overlapping elements).


Step 7: Configuring for VisualWebArena

VisualWebArena traces include additional visual grounding information. The configuration is similar but uses the visual grounding overlay:

yaml
agentic:
  enabled: true
  trace_converter: webarena         # same converter handles both
  display_type: web_agent
 
  web_agent_display:
    screenshot_max_width: 1000
    overlay:
      enabled: true
      click_marker: "crosshair"     # crosshair is better for precise grounding
      click_color: "#ef4444"
      click_radius: 15
      bounding_box: true            # show element bounding box if available
      bounding_box_color: "#f59e0b"
    filmstrip:
      enabled: true
      thumbnail_width: 180

Step 8: Analyzing Results

Action Correctness by Step Position

Web agent errors often cluster at specific points in the trace. Analyze where errors occur:

python
import pandas as pd
import json
 
annotations = []
with open("output/annotations.jsonl") as f:
    for line in f:
        annotations.append(json.loads(line))
 
# Collect per-step correctness by position
step_errors = {}
for ann in annotations:
    correctness = ann["annotations"].get("action_correctness", {})
    for step_idx, label in correctness.items():
        pos = int(step_idx)
        if pos not in step_errors:
            step_errors[pos] = {"Correct": 0, "Acceptable": 0, "Incorrect": 0, "Unnecessary": 0}
        step_errors[pos][label] += 1
 
# Print error rate by step position
print("Error rate by step position:")
for pos in sorted(step_errors.keys()):
    counts = step_errors[pos]
    total = sum(counts.values())
    error_rate = (counts["Incorrect"] + counts["Unnecessary"]) / total
    print(f"  Step {pos}: {error_rate:.1%} error rate ({total} observations)")

Error Type Distribution

python
error_counts = {}
for ann in annotations:
    errors = ann["annotations"].get("action_error_type", {})
    for step_idx, error_list in errors.items():
        for error in error_list:
            error_counts[error] = error_counts.get(error, 0) + 1
 
print("\nError Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
    print(f"  {error}: {count}")

Success Rate by Website

python
# If traces span multiple websites
website_success = {}
for ann in annotations:
    # Assuming website info is in the original trace data
    success = ann["annotations"]["task_success"]
    website = ann.get("metadata", {}).get("website", "unknown")
    if website not in website_success:
        website_success[website] = {"Success": 0, "Partial Success": 0, "Failure": 0}
    website_success[website][success] += 1
 
for website, counts in website_success.items():
    total = sum(counts.values())
    rate = counts["Success"] / total
    print(f"{website}: {rate:.1%} success rate")

Step 9: Scaling the Evaluation

Multiple Annotators with Agreement

For research papers, assign multiple annotators per trace:

yaml
annotation_task_config:
  total_annotations_per_instance: 3
  assignment_strategy: random

Compute inter-annotator agreement on the task success label:

python
from sklearn.metrics import cohen_kappa_score
import pandas as pd
 
df = pd.read_parquet("output/parquet/annotations.parquet")
success = df[df["schema_name"] == "task_success"]
pivot = success.pivot(index="instance_id", columns="annotator", values="value")
 
# Pairwise kappa
annotators = pivot.columns.tolist()
for i in range(len(annotators)):
    for j in range(i + 1, len(annotators)):
        mask = pivot[[annotators[i], annotators[j]]].dropna()
        kappa = cohen_kappa_score(mask[annotators[i]], mask[annotators[j]])
        print(f"Kappa ({annotators[i]} vs {annotators[j]}): {kappa:.3f}")

Combining with Solo Mode

For large-scale evaluations (500+ traces), use Solo Mode to let an LLM handle the easy traces:

yaml
solo_mode:
  enabled: true
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
  accuracy_threshold: 0.90
 
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent

The human evaluates the hard traces; the LLM handles straightforward successes and obvious failures.


Summary

Evaluating web browsing agents requires seeing exactly what the agent saw and did. Potato's web agent display provides:

  • Full screenshots with SVG overlays marking click targets, input fields, and scroll actions
  • Filmstrip navigation for quick overview and random access to steps
  • URL bar tracking the agent's navigation path
  • Per-step annotation with a web-specific error taxonomy
  • Flexible configuration for WebArena, VisualWebArena, and raw browser recordings

The key to effective web agent evaluation is the visual overlay: without seeing exactly where the agent clicked, evaluators cannot reliably judge action correctness.


Further Reading