Web Browsing Agents को Annotate करना: WebArena Traces से Human Evaluation तक

Web browsing agents text-based agents से मौलिक रूप से भिन्न modality में काम करते हैं। वे वास्तविक web pages navigate करते हैं, buttons click करते हैं, forms भरते हैं, और content के माध्यम से scroll करते हैं। उनका मूल्यांकन करने के लिए agent ने क्या देखा (page state) और agent ने क्या किया (action taken) को देखना आवश्यक है, आदर्श रूप से visual overlays के साथ जो दिखाते हैं कि agent ने ठीक कहाँ click किया।

Potato का web agent trace display इस कार्य के लिए विशेष रूप से बनाया गया है। यह SVG action overlays के साथ full-page screenshots render करता है, त्वरित navigation के लिए filmstrip view प्रदान करता है, और action correctness की per-step annotation का समर्थन करता है।

यह गाइड WebArena traces का मूल्यांकन करने के माध्यम से चलती है, लेकिन यही दृष्टिकोण VisualWebArena, raw browser recordings, और किसी भी अन्य web agent format के लिए काम करता है।

पूर्व आवश्यकताएँ

bash

pip install potato-annotation

आपको WebArena trace files की आवश्यकता होगी, जिनमें आमतौर पर screenshots और एक JSON action log शामिल होता है। यदि आप VisualWebArena के साथ काम कर रहे हैं, तो format समान है लेकिन इसमें additional visual grounding information शामिल हो सकती है।

चरण 1: WebArena Trace Format को समझें

एक WebArena trace में प्रति episode एक JSON file होती है जिसमें task description, action sequence, और screenshot paths होते हैं। यहाँ एक सरलीकृत उदाहरण है।

data/web_traces.jsonl बनाएं:

json

{
  "trace_id": "wa_001",
  "task": "Find the cheapest laptop on the electronics store and add it to the cart",
  "website": "shopping",
  "steps": [
    {
      "step": 0,
      "url": "http://shop.example.com/",
      "action_type": "click",
      "action_target": "Electronics category link",
      "element_id": "nav-electronics",
      "coordinates": [245, 82],
      "screenshot": "screenshots/wa_001_step_00.png",
      "dom_snapshot": "dom/wa_001_step_00.html"
    },
    {
      "step": 1,
      "url": "http://shop.example.com/electronics",
      "action_type": "click",
      "action_target": "Laptops subcategory",
      "element_id": "cat-laptops",
      "coordinates": [180, 310],
      "screenshot": "screenshots/wa_001_step_01.png"
    },
    {
      "step": 2,
      "url": "http://shop.example.com/electronics/laptops",
      "action_type": "click",
      "action_target": "Sort by: Price Low to High",
      "element_id": "sort-price-asc",
      "coordinates": [720, 155],
      "screenshot": "screenshots/wa_001_step_02.png"
    },
    {
      "step": 3,
      "url": "http://shop.example.com/electronics/laptops?sort=price_asc",
      "action_type": "click",
      "action_target": "First laptop: 'Budget Pro 14' - $349",
      "element_id": "product-101",
      "coordinates": [400, 380],
      "screenshot": "screenshots/wa_001_step_03.png"
    },
    {
      "step": 4,
      "url": "http://shop.example.com/product/101",
      "action_type": "click",
      "action_target": "Add to Cart button",
      "element_id": "add-to-cart-btn",
      "coordinates": [650, 520],
      "screenshot": "screenshots/wa_001_step_04.png"
    }
  ],
  "success": true,
  "final_screenshot": "screenshots/wa_001_final.png"
}

प्रत्येक step में एक screenshot, action लिया गया, target element, और click coordinates होते हैं। Potato इस जानकारी का उपयोग visual overlays render करने के लिए करता है।

चरण 2: Project कॉन्फ़िगर करें

config.yaml बनाएं:

yaml

task_name: "WebArena Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/web_traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
# --- Agentic annotation with web display ---
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
 
  web_agent_display:
    # Screenshot rendering
    screenshot_max_width: 900
    screenshot_quality: 85
 
    # SVG overlays
    overlay:
      enabled: true
      click_marker: "circle"
      click_color: "#ef4444"
      click_radius: 20
      type_highlight: "#3b82f6"
      scroll_indicator: true
 
    # Filmstrip navigation
    filmstrip:
      enabled: true
      thumbnail_width: 150
      show_action_labels: true
 
    # Additional display options
    show_url_bar: true
    show_action_description: true
    show_dom_snapshot: false
 
# --- Annotation Schemas ---
annotation_schemes:
  # Overall task evaluation
  - annotation_type: radio
    name: task_success
    description: "Did the agent complete the task successfully?"
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    label_requirement:
      required: true
 
  - annotation_type: radio
    name: task_efficiency
    description: "Was the agent's navigation path efficient?"
    labels:
      - "Optimal path"
      - "Reasonable but not optimal"
      - "Inefficient (unnecessary steps)"
      - "Completely wrong direction"
    label_requirement:
      required: true
 
  # Per-step evaluation
  - annotation_type: per_turn_rating
    name: action_correctness
    target: agentic_steps
    description: "Was this action correct?"
    rating_type: radio
    labels:
      - "Correct"
      - "Acceptable (not optimal but progresses toward goal)"
      - "Incorrect"
      - "Unnecessary"
 
  - annotation_type: per_turn_rating
    name: action_error_type
    target: agentic_steps
    description: "What went wrong?"
    rating_type: multiselect
    labels:
      - "Wrong element clicked"
      - "Wrong page navigated to"
      - "Missed a closer/better option"
      - "Incorrect form input"
      - "Premature task completion"
      - "Unnecessary navigation"
      - "Failed to scroll to target"
      - "Interaction with wrong page section"
      - "Other"
    conditional:
      show_when:
        action_correctness: ["Incorrect", "Unnecessary"]
 
  - annotation_type: per_turn_rating
    name: action_notes
    target: agentic_steps
    description: "Notes on this step"
    rating_type: text
    label_requirement:
      required: false
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"

चरण 3: Web Agent Display को समझें

जब आप एक trace खोलते हैं, तो web agent display दिखाता है:

Main Screenshot View

वर्तमान step का screenshot पूरी चौड़ाई पर (900px तक) प्रदर्शित होता है। उसके ऊपर एक SVG overlay खींचा जाता है:

Click coordinates पर Red circle, दिखाता है कि agent ने ठीक कहाँ click किया
Text input fields के चारों ओर Blue highlight जहाँ agent ने type किया
Scroll actions के लिए दिशा और परिमाण दिखाने वाला Arrow indicator

Screenshot के नीचे आप देखते हैं:

URL bar इस step पर page URL दिखाता है
Action description (जैसे, "Click on 'Electronics category link' at coordinates [245, 82]")

Filmstrip

Display के निचले भाग में, एक horizontal filmstrip सभी screenshots के thumbnails दिखाता है। प्रत्येक thumbnail पर action type (click, type, scroll) दर्शाने वाला एक छोटा label होता है। किसी भी thumbnail पर click करके उस step पर जाएं।

Filmstrip विशेष रूप से लंबे traces (10+ steps) के लिए मूल्यवान है जहाँ main view में scroll करना थकाऊ होगा।

Per-Step Annotation

प्रत्येक screenshot के बगल में, per-step annotation controls दिखाई देते हैं। Action को rate करें, और यदि यह गलत है, तो error type चुनें।

चरण 4: Annotation Workflow

Web agent trace का मूल्यांकन करने का सामान्य workflow:

Task description पढ़ें। समझें कि agent को क्या accomplish करना था।
Overview के लिए filmstrip का उपयोग करें। Individual steps rating करने से पहले agent के trajectory का अनुभव लेने के लिए सभी screenshots को तेज़ी से scan करें।
प्रत्येक step के माध्यम से जाएं:
- Page state समझने के लिए screenshot देखें
- Agent ने क्या click किया यह देखने के लिए SVG overlay check करें
- Action description पढ़ें
- Action को Correct, Acceptable, Incorrect, या Unnecessary rate करें
- यदि incorrect है, तो error type(s) चुनें
समग्र trace को rate करें। सभी steps की समीक्षा के बाद, task success और efficiency rate करें।
Submit करें और अगले trace पर जाएं।

क्या देखना है

Correct actions agent को लक्ष्य की ओर एक उचित तरीके से करीब ले जाते हैं। Agent ने सही element click किया, सही page navigate किया, या सही जानकारी दर्ज की।

Acceptable actions इष्टतम विकल्प नहीं हैं लेकिन फिर भी प्रगति करते हैं। उदाहरण के लिए, agent search bar का उपयोग करने के बजाय category page पर browse करता है -- धीमा, लेकिन फिर भी व्यवहार्य।

Incorrect actions गलतियाँ हैं: गलत element click करना, irrelevant page navigate करना, या form में गलत जानकारी दर्ज करना।

Unnecessary actions लक्ष्य में योगदान नहीं करते: कुछ click करना और फिर तुरंत वापस जाना, target से आगे scroll करना, या उन pages पर navigate करना जो relevant नहीं हैं।

चरण 5: Error Taxonomy

Potato में web agent actions के लिए एक purpose-built error taxonomy है। प्रत्येक category को कैसे लागू करें:

Error Type	विवरण	उदाहरण
Wrong element clicked	Agent ने गलत UI element click किया	"Tablets" की बजाय "Laptops" click किया
Wrong page navigated to	Agent irrelevant page पर पहुँचा	Product listing की बजाय "About Us" navigate किया
Missed a closer/better option	एक बेहतर action उपलब्ध था	Search bar की बजाय category browsing का उपयोग किया
Incorrect form input	Agent ने form में गलत text दर्ज किया	"laptop" की बजाय "labtop" search किया
Premature task completion	Agent ने बहुत जल्दी सफलता घोषित की	गलत item cart में जोड़ा और रुक गया
Unnecessary navigation	Step लक्ष्य में योगदान नहीं करता	Category pages के बीच homepage visit किया
Failed to scroll to target	Target viewport के नीचे था	Element visible नहीं था; agent को scroll करना चाहिए था
Interaction with wrong page section	सही page लेकिन गलत area	Main content की बजाय header click किया

चरण 6: Complex Traces को संभालना

Long Traces (15+ Steps)

लंबे traces के लिए, पहले suspicious steps पहचानने के लिए filmstrip का उपयोग करें। इन पर ध्यान दें:

Steps जहाँ URL अप्रत्याशित रूप से बदलता है (गलत navigation)
Steps जहाँ agent पीछे जाता प्रतीत होता है
बार-बार समान screenshots (agent एक loop में फंसा)

फिर उन steps पर अपना विस्तृत annotation केंद्रित करें।

Failed Traces

उन traces के लिए जहाँ agent विफल होता है, पहले incorrect step की पहचान करें -- यह agent को सुधारने के लिए पूरी प्रक्रिया में सबसे मूल्यवान annotation है। इसे स्पष्ट रूप से mark करें और वर्णन करें कि agent को इसके बजाय क्या करना चाहिए था।

Ambiguous Actions

कुछ actions पूरे page content को जाने बिना judge करना कठिन होता है। यदि DOM snapshot उपलब्ध है, तो इसे enable करें:

yaml

web_agent_display:
  show_dom_snapshot: true

यह raw HTML दिखाने वाला एक collapsible panel जोड़ता है, जो तब मदद करता है जब screenshot अकेले ambiguous हो (जैसे, agent ने एक ऐसे region में click किया जिसमें कई overlapping elements हों)।

चरण 7: VisualWebArena के लिए कॉन्फ़िगर करना

VisualWebArena traces में additional visual grounding information शामिल होती है। Configuration समान है लेकिन visual grounding overlay का उपयोग करती है:

yaml

agentic:
  enabled: true
  trace_converter: webarena         # same converter handles both
  display_type: web_agent
 
  web_agent_display:
    screenshot_max_width: 1000
    overlay:
      enabled: true
      click_marker: "crosshair"     # crosshair is better for precise grounding
      click_color: "#ef4444"
      click_radius: 15
      bounding_box: true            # show element bounding box if available
      bounding_box_color: "#f59e0b"
    filmstrip:
      enabled: true
      thumbnail_width: 180

चरण 8: Results का विश्लेषण करें

Step Position के अनुसार Action Correctness

Web agent errors अक्सर trace में विशिष्ट बिंदुओं पर cluster होती हैं। विश्लेषण करें कि errors कहाँ होती हैं:

python

import pandas as pd
import json
 
annotations = []
with open("output/annotations.jsonl") as f:
    for line in f:
        annotations.append(json.loads(line))
 
# Collect per-step correctness by position
step_errors = {}
for ann in annotations:
    correctness = ann["annotations"].get("action_correctness", {})
    for step_idx, label in correctness.items():
        pos = int(step_idx)
        if pos not in step_errors:
            step_errors[pos] = {"Correct": 0, "Acceptable": 0, "Incorrect": 0, "Unnecessary": 0}
        step_errors[pos][label] += 1
 
# Print error rate by step position
print("Error rate by step position:")
for pos in sorted(step_errors.keys()):
    counts = step_errors[pos]
    total = sum(counts.values())
    error_rate = (counts["Incorrect"] + counts["Unnecessary"]) / total
    print(f"  Step {pos}: {error_rate:.1%} error rate ({total} observations)")

Error Type Distribution

python

error_counts = {}
for ann in annotations:
    errors = ann["annotations"].get("action_error_type", {})
    for step_idx, error_list in errors.items():
        for error in error_list:
            error_counts[error] = error_counts.get(error, 0) + 1
 
print("\nError Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
    print(f"  {error}: {count}")

Website के अनुसार Success Rate

python

# If traces span multiple websites
website_success = {}
for ann in annotations:
    # Assuming website info is in the original trace data
    success = ann["annotations"]["task_success"]
    website = ann.get("metadata", {}).get("website", "unknown")
    if website not in website_success:
        website_success[website] = {"Success": 0, "Partial Success": 0, "Failure": 0}
    website_success[website][success] += 1
 
for website, counts in website_success.items():
    total = sum(counts.values())
    rate = counts["Success"] / total
    print(f"{website}: {rate:.1%} success rate")

चरण 9: Evaluation को Scale करें

Agreement के साथ Multiple Annotators

Research papers के लिए, प्रति trace कई annotators assign करें:

yaml

annotation_task_config:
  total_annotations_per_instance: 3
  assignment_strategy: random

Task success label पर inter-annotator agreement compute करें:

python

from sklearn.metrics import cohen_kappa_score
import pandas as pd
 
df = pd.read_parquet("output/parquet/annotations.parquet")
success = df[df["schema_name"] == "task_success"]
pivot = success.pivot(index="instance_id", columns="annotator", values="value")
 
# Pairwise kappa
annotators = pivot.columns.tolist()
for i in range(len(annotators)):
    for j in range(i + 1, len(annotators)):
        mask = pivot[[annotators[i], annotators[j]]].dropna()
        kappa = cohen_kappa_score(mask[annotators[i]], mask[annotators[j]])
        print(f"Kappa ({annotators[i]} vs {annotators[j]}): {kappa:.3f}")

Solo Mode के साथ संयोजन

बड़े पैमाने पर evaluations (500+ traces) के लिए, easy traces को handle करने के लिए Solo Mode का उपयोग करें:

yaml

solo_mode:
  enabled: true
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
  accuracy_threshold: 0.90
 
agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent

Human evaluator कठिन traces का मूल्यांकन करता है; LLM straightforward successes और obvious failures को handle करता है।

सारांश

Web browsing agents का मूल्यांकन करने के लिए यह देखना आवश्यक है कि agent ने ठीक क्या देखा और क्या किया। Potato का web agent display प्रदान करता है:

Full screenshots SVG overlays के साथ click targets, input fields, और scroll actions mark करते हुए
Filmstrip navigation steps का त्वरित overview और random access के लिए
URL bar agent के navigation path को track करता है
Per-step annotation web-specific error taxonomy के साथ
Flexible configuration WebArena, VisualWebArena, और raw browser recordings के लिए

Effective web agent evaluation की कुंजी visual overlay है: ठीक कहाँ agent ने click किया यह देखे बिना, evaluators action correctness का विश्वसनीय रूप से judge नहीं कर सकते।

आगे पढ़ें

Agentic Annotation Documentation -- पूर्ण configuration reference
AI Agents का मूल्यांकन -- सामान्य agent evaluation गाइड
Solo Mode -- human-LLM collaboration के साथ evaluation scale करें
Parquet Export -- विश्लेषण के लिए results export करें