Showcase/WebArena: Realistic Web Agent Evaluation Benchmark

advancedevaluation

WebArena: Realistic Web Agent Evaluation Benchmark

WebArena is a reproducible, self-hosted web environment with 812 tasks for testing autonomous language agents. This Potato config has annotators score whether an agent's actions completed each task.

About this dataset

WebArena is a standalone, self-hosted web environment for building and evaluating autonomous agents that operate browsers from natural-language instructions. It was created by Shuyan Zhou, Frank F. Xu, and collaborators at Carnegie Mellon University and presented at ICLR 2024.

The benchmark contains 812 tasks instantiated from 241 intent templates. Tasks run across five fully functional websites that mirror common categories: online shopping (OneStopShop), an e-commerce content management system, a social forum (Reddit), collaborative software development (GitLab), and a map (OpenStreetMap), plus supporting tools such as Wikipedia, a calculator, and a scratchpad.

Because the sites are self-hosted and sandboxed, runs are reproducible. WebArena measures success by functional correctness: each task ships a programmatic reward function that checks the resulting environment state or returned answer rather than matching an action sequence. In the original paper the best GPT-4 agent reached an end-to-end success rate of 14.41%, against 78.24% for humans.

The Potato config below reproduces this evaluation as a human annotation task. Reviewers read an agent's trajectory and target task, mark whether it succeeded, fully or partially, and record the failure mode in free text.

Tasks: 812
Intent templates: 241
Website environments: 5 (shopping, CMS, Reddit, GitLab, map)
Success metric: Functional correctness (programmatic reward)
GPT-4 agent success: 14.41%
Human success: 78.24%

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# WebArena: Web Agent Task Completion Evaluation
# Based on "WebArena: A Realistic Web Environment for Building Autonomous Agents" (Zhou et al., ICLR 2024)
# Task: Evaluate whether an agent's actions correctly complete a given web task

annotation_task_name: "WebArena Web Agent Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing task, trajectory, and final state
html_layout: |
  <div class="webarena-container">
    <div class="task-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">Task Instruction:</h3>
      <div class="task-text" style="font-size: 16px; font-weight: bold;">{{text}}</div>
    </div>
    <div class="website-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
      <strong>Website Type:</strong> {{website_type}}
    </div>
    <div class="trajectory-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
      <h3 style="margin-top: 0; color: #f9a825;">Agent Action Trajectory:</h3>
      <div class="trajectory-text" style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
    </div>
    <div class="final-state-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
      <h3 style="margin-top: 0; color: #7b1fa2;">Final State:</h3>
      <div class="final-state-text" style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Task completion assessment
  - name: "task_completion"
    description: "Did the agent successfully complete the task?"
    annotation_type: radio
    labels:
      - "Fully Complete - task achieved correctly"
      - "Partially Complete - some subtasks done"
      - "Failed - task not completed"
      - "Wrong - completed a different task"
    keyboard_shortcuts:
      "Fully Complete - task achieved correctly": "1"
      "Partially Complete - some subtasks done": "2"
      "Failed - task not completed": "3"
      "Wrong - completed a different task": "4"

  # Failure explanation
  - name: "failure_explanation"
    description: "If the task was not fully completed, explain what went wrong. Leave blank for fully completed tasks."
    annotation_type: text
    required: false
    placeholder: "Describe the failure: what step went wrong, what the agent should have done instead..."

  # Failure type (if applicable)
  - name: "failure_type"
    description: "If the task failed or was partially complete, what type of failure occurred?"
    annotation_type: radio
    labels:
      - "Navigation error - went to wrong page/section"
      - "Element selection error - interacted with wrong element"
      - "Input error - typed wrong value or parameters"
      - "Sequence error - correct actions in wrong order"
      - "Missing step - skipped a required action"
      - "Premature termination - stopped too early"
      - "Infinite loop - repeated actions without progress"
      - "N/A - task was fully completed"
    keyboard_shortcuts:
      "Navigation error - went to wrong page/section": "q"
      "Element selection error - interacted with wrong element": "w"
      "Input error - typed wrong value or parameters": "e"
      "Sequence error - correct actions in wrong order": "r"
      "Missing step - skipped a required action": "t"
      "Premature termination - stopped too early": "y"
      "Infinite loop - repeated actions without progress": "u"
      "N/A - task was fully completed": "i"

  # Trajectory efficiency
  - name: "trajectory_efficiency"
    description: "How efficient was the agent's action trajectory?"
    annotation_type: radio
    labels:
      - "Optimal - minimal steps needed"
      - "Acceptable - some unnecessary steps but reasonable"
      - "Inefficient - many unnecessary steps"
      - "Very inefficient - excessive wandering or backtracking"
    keyboard_shortcuts:
      "Optimal - minimal steps needed": "a"
      "Acceptable - some unnecessary steps but reasonable": "s"
      "Inefficient - many unnecessary steps": "d"
      "Very inefficient - excessive wandering or backtracking": "f"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 100
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "wa_001",
    "text": "Find the most recent order on the shopping site and check its delivery status.",
    "trajectory": "Step 1: Navigate to homepage -> https://shop.example.com/\nStep 2: Click 'My Account' link in top navigation\nStep 3: Click 'Order History' tab\nStep 4: Click on the first order (Order #2024-1847, placed Dec 15 2024)\nStep 5: Scroll down to 'Delivery Status' section\nStep 6: STOP - Task complete",
    "final_state": "Page displays Order #2024-1847 details. Delivery status shows: 'In Transit - Expected delivery Dec 20, 2024'. Tracking number: 1Z999AA10123456784. Carrier: UPS.",
    "website_type": "E-commerce (Shopping)"
  },
  {
    "id": "wa_002",
    "text": "Create a new repository named 'test-project' with a README file on the code hosting platform.",
    "trajectory": "Step 1: Navigate to https://gitlab.example.com/\nStep 2: Click '+' button in top navigation bar\nStep 3: Click 'New project/repository' from dropdown\nStep 4: Select 'Create blank project'\nStep 5: Type 'test-project' in 'Project name' field\nStep 6: Check 'Initialize repository with a README' checkbox\nStep 7: Click 'Create project' button\nStep 8: STOP - Task complete",
    "final_state": "New repository page displayed at gitlab.example.com/user/test-project. Repository contains a single README.md file. Project visibility is set to 'Private' (default). Branch: main.",
    "website_type": "Code Hosting (GitLab)"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/webarena-web-agent-eval
potato start config.yaml

Dataset & paper

Zhou et al., ICLR 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{zhou2024webarena,
  title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
  author={Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

Details

Annotation Types

radiotext

Domain

Web AgentsAutonomous SystemsEvaluation

Use Cases

Agent EvaluationTask Completion AssessmentFailure Analysis

Related Designs

AgentRewardBench Trajectory Scoring

Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.

multirateradio

FActScore: Atomic Factual Precision Evaluation for LLMs

FActScore breaks LLM-generated text into atomic facts and scores the percentage supported by Wikipedia, labeling each as supported, not-supported, or irrelevant. Includes dataset and paper links plus a Potato config to reproduce the annotation task.