Skip to content
Showcase/VisualWebArena: Visual Web Agent Evaluation
advancedimage

VisualWebArena: Visual Web Agent Evaluation

Evaluation of multimodal web agents on visually grounded web tasks. Annotators assess task completion, visual grounding accuracy, and visual reasoning capabilities by reviewing screenshot sequences and agent trajectories.

Labels:outdoornatureurbanpeopleanimal+

Configuration Fileconfig.yaml

# VisualWebArena: Visual Web Agent Evaluation
# Based on "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks" (Koh et al., ACL 2024)
# Task: Evaluate multimodal agent performance on visually grounded web tasks

annotation_task_name: "VisualWebArena Visual Agent Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout with screenshot gallery and agent trajectory
html_layout: |
  <div class="vwa-container" style="max-width: 900px; margin: 0 auto;">
    <div class="task-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 15px; border-left: 4px solid #1565c0;">
      <h3 style="margin-top: 0; color: #1565c0;">Task Instruction</h3>
      <div style="font-size: 16px; font-weight: bold;">{{text}}</div>
      <div style="margin-top: 8px; color: #555;"><strong>Website:</strong> {{website}}</div>
    </div>
    <div class="screenshot-gallery" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #90a4ae;">
      <h3 style="margin-top: 0; color: #37474f;">Screenshot Sequence</h3>
      <div style="white-space: pre-wrap; font-size: 13px; line-height: 1.8; font-family: monospace; background: #fff; padding: 12px; border-radius: 6px;">{{screenshots}}</div>
    </div>
    <div class="trajectory-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
      <h3 style="margin-top: 0; color: #f57f17;">Agent Trajectory</h3>
      <div style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
    </div>
    <div class="final-state-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
      <h3 style="margin-top: 0; color: #7b1fa2;">Final State</h3>
      <div style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Task completion assessment
  - name: "task_completion"
    description: "Did the agent successfully complete the visual web task?"
    annotation_type: radio
    labels:
      - "Complete — task fully achieved"
      - "Partial — some subtasks done"
      - "Failed — task not completed"
      - "Wrong Task — completed different task"
    keyboard_shortcuts:
      "Complete — task fully achieved": "1"
      "Partial — some subtasks done": "2"
      "Failed — task not completed": "3"
      "Wrong Task — completed different task": "4"

  # Visual grounding assessment
  - name: "visual_grounding"
    description: "What visual grounding issues (if any) did the agent exhibit?"
    annotation_type: multiselect
    labels:
      - "Wrong Element Clicked"
      - "Missed Visual Cue"
      - "Incorrect Text Reading"
      - "Wrong Image Interpretation"
      - "Incorrect Layout Understanding"
      - "Correct Visual Grounding"

  # Visual reasoning assessment
  - name: "visual_reasoning"
    description: "Did the agent correctly reason about visual content on the page?"
    annotation_type: radio
    labels:
      - "Yes — correctly interpreted visual content"
      - "Partially — some visual understanding errors"
      - "No — fundamental visual misunderstanding"
    keyboard_shortcuts:
      "Yes — correctly interpreted visual content": "q"
      "Partially — some visual understanding errors": "w"
      "No — fundamental visual misunderstanding": "e"

  # Failure notes
  - name: "failure_notes"
    description: "Describe any visual grounding failures or issues observed."
    annotation_type: text
    required: false
    placeholder: "Describe visual grounding failures: which elements were misidentified, what visual cues were missed..."

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "vwa_001",
    "text": "Find the red sneakers shown in the homepage banner image and add them to the shopping cart in size 10.",
    "website": "E-commerce (Shopping)",
    "screenshots": "Screenshot 1: Homepage with large hero banner showing a lifestyle photo of a runner wearing red Nike Air Max sneakers on a trail. Navigation bar at top with categories: Men, Women, Kids, Sale. Search bar in upper right. Below the banner, a grid of 12 featured products with thumbnail images and prices.\n\nScreenshot 2: Search results page showing 8 sneaker products. The agent has typed 'red sneakers' in the search bar. Results include red Nike Air Max ($129.99), red Adidas Ultraboost ($149.99), red Puma RS-X ($89.99), and others. Each result has a product image, name, price, and star rating.\n\nScreenshot 3: Product detail page for Nike Air Max in red colorway. Large product image on left showing the shoe from side angle. Right side has title, price ($129.99), color selector with red highlighted, size dropdown menu expanded showing sizes 8-13, and 'Add to Cart' button.\n\nScreenshot 4: Same product page with size 10 selected in the dropdown. The 'Add to Cart' button is highlighted. A mini-cart popup in the upper right shows '1 item' with the red Nike Air Max thumbnail.",
    "trajectory": "Step 1: Observe homepage banner — identify red sneakers in hero image\nStep 2: Click search bar in upper right corner\nStep 3: Type 'red sneakers' and press Enter\nStep 4: Scan search results — identify Nike Air Max matching banner image\nStep 5: Click on Nike Air Max product listing\nStep 6: Click size dropdown menu\nStep 7: Select size '10' from dropdown options\nStep 8: Click 'Add to Cart' button\nStep 9: STOP — Task complete",
    "final_state": "Product page for Nike Air Max (Red) displayed. Size 10 selected. Cart indicator shows 1 item. Mini-cart popup confirms: Nike Air Max - Red - Size 10 - $129.99. Cart total: $129.99."
  },
  {
    "id": "vwa_002",
    "text": "On the travel website, find a hotel that looks similar to the one shown in the reference image (a beachfront resort with a pool) and book it for 2 nights.",
    "website": "Travel Booking",
    "screenshots": "Screenshot 1: Travel homepage with search form. A reference image is displayed at the top showing a tropical beachfront resort — white multi-story building, infinity pool facing the ocean, palm trees lining a white sand beach, thatched-roof cabanas by the pool.\n\nScreenshot 2: Hotel search results for 'Miami Beach' showing 6 hotels in a grid layout. Each has a large photo, name, star rating, price per night. Hotels include: The Palms Resort (beachfront with pool, $289/night), Ocean View Inn (city view, $159/night), Seaside Boutique (small beach, no pool, $199/night), Grand Beach Resort (large beachfront with infinity pool, $349/night).\n\nScreenshot 3: The agent has clicked on Grand Beach Resort. Detail page shows a photo carousel with 8 images — the main photo shows a white building with an infinity pool overlooking the ocean, closely matching the reference image. Room selection form visible below with date pickers and room type options.\n\nScreenshot 4: Booking confirmation page showing Grand Beach Resort, Deluxe Ocean View room, 2 nights, total $698. Check-in and check-out dates filled. Guest name form partially visible.",
    "trajectory": "Step 1: View reference image of beachfront resort with pool\nStep 2: Enter 'Miami Beach' in destination search field\nStep 3: Set check-in date to next available date\nStep 4: Set check-out date to 2 days later\nStep 5: Click 'Search' button\nStep 6: Scan hotel results — compare photos to reference image\nStep 7: Click on Grand Beach Resort (best visual match — beachfront, white building, infinity pool)\nStep 8: Review photo carousel to confirm visual similarity\nStep 9: Select 'Deluxe Ocean View' room type\nStep 10: Click 'Book Now' button\nStep 11: STOP — Task complete",
    "final_state": "Booking confirmation for Grand Beach Resort displayed. Room: Deluxe Ocean View. Duration: 2 nights. Total: $698. The selected hotel visually matches the reference image — beachfront location, white building, infinity pool facing the ocean."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/visualwebarena-visual-agent-eval
potato start config.yaml

Details

Annotation Types

radiomultiselecttext

Domain

Visual Web AgentsMultimodal AgentsGUI Agents

Use Cases

Agent EvaluationVisual Grounding AssessmentTask Completion Analysis

Tags

visual-agentweb-agentmultimodalvisual-groundingscreenshottrajectorygui-agent

Found an issue or want to improve this design?

Open an Issue