WebArena: Realistic Web Agent Evaluation Benchmark
WebArena is a reproducible, self-hosted web environment with 812 tasks for testing autonomous language agents. This Potato config has annotators score whether an agent's actions completed each task.
About this dataset
WebArena is a standalone, self-hosted web environment for building and evaluating autonomous agents that operate browsers from natural-language instructions. It was created by Shuyan Zhou, Frank F. Xu, and collaborators at Carnegie Mellon University and presented at ICLR 2024.
The benchmark contains 812 tasks instantiated from 241 intent templates. Tasks run across five fully functional websites that mirror common categories: online shopping (OneStopShop), an e-commerce content management system, a social forum (Reddit), collaborative software development (GitLab), and a map (OpenStreetMap), plus supporting tools such as Wikipedia, a calculator, and a scratchpad.
Because the sites are self-hosted and sandboxed, runs are reproducible. WebArena measures success by functional correctness: each task ships a programmatic reward function that checks the resulting environment state or returned answer rather than matching an action sequence. In the original paper the best GPT-4 agent reached an end-to-end success rate of 14.41%, against 78.24% for humans.
The Potato config below reproduces this evaluation as a human annotation task. Reviewers read an agent's trajectory and target task, mark whether it succeeded, fully or partially, and record the failure mode in free text.
- Tasks
- 812
- Intent templates
- 241
- Website environments
- 5 (shopping, CMS, Reddit, GitLab, map)
- Success metric
- Functional correctness (programmatic reward)
- GPT-4 agent success
- 14.41%
- Human success
- 78.24%
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# WebArena: Web Agent Task Completion Evaluation
# Based on "WebArena: A Realistic Web Environment for Building Autonomous Agents" (Zhou et al., ICLR 2024)
# Task: Evaluate whether an agent's actions correctly complete a given web task
annotation_task_name: "WebArena Web Agent Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing task, trajectory, and final state
html_layout: |
<div class="webarena-container">
<div class="task-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">Task Instruction:</h3>
<div class="task-text" style="font-size: 16px; font-weight: bold;">{{text}}</div>
</div>
<div class="website-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
<strong>Website Type:</strong> {{website_type}}
</div>
<div class="trajectory-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
<h3 style="margin-top: 0; color: #f9a825;">Agent Action Trajectory:</h3>
<div class="trajectory-text" style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
</div>
<div class="final-state-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
<h3 style="margin-top: 0; color: #7b1fa2;">Final State:</h3>
<div class="final-state-text" style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Task completion assessment
- name: "task_completion"
description: "Did the agent successfully complete the task?"
annotation_type: radio
labels:
- "Fully Complete - task achieved correctly"
- "Partially Complete - some subtasks done"
- "Failed - task not completed"
- "Wrong - completed a different task"
keyboard_shortcuts:
"Fully Complete - task achieved correctly": "1"
"Partially Complete - some subtasks done": "2"
"Failed - task not completed": "3"
"Wrong - completed a different task": "4"
# Failure explanation
- name: "failure_explanation"
description: "If the task was not fully completed, explain what went wrong. Leave blank for fully completed tasks."
annotation_type: text
required: false
placeholder: "Describe the failure: what step went wrong, what the agent should have done instead..."
# Failure type (if applicable)
- name: "failure_type"
description: "If the task failed or was partially complete, what type of failure occurred?"
annotation_type: radio
labels:
- "Navigation error - went to wrong page/section"
- "Element selection error - interacted with wrong element"
- "Input error - typed wrong value or parameters"
- "Sequence error - correct actions in wrong order"
- "Missing step - skipped a required action"
- "Premature termination - stopped too early"
- "Infinite loop - repeated actions without progress"
- "N/A - task was fully completed"
keyboard_shortcuts:
"Navigation error - went to wrong page/section": "q"
"Element selection error - interacted with wrong element": "w"
"Input error - typed wrong value or parameters": "e"
"Sequence error - correct actions in wrong order": "r"
"Missing step - skipped a required action": "t"
"Premature termination - stopped too early": "y"
"Infinite loop - repeated actions without progress": "u"
"N/A - task was fully completed": "i"
# Trajectory efficiency
- name: "trajectory_efficiency"
description: "How efficient was the agent's action trajectory?"
annotation_type: radio
labels:
- "Optimal - minimal steps needed"
- "Acceptable - some unnecessary steps but reasonable"
- "Inefficient - many unnecessary steps"
- "Very inefficient - excessive wandering or backtracking"
keyboard_shortcuts:
"Optimal - minimal steps needed": "a"
"Acceptable - some unnecessary steps but reasonable": "s"
"Inefficient - many unnecessary steps": "d"
"Very inefficient - excessive wandering or backtracking": "f"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 100
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "wa_001",
"text": "Find the most recent order on the shopping site and check its delivery status.",
"trajectory": "Step 1: Navigate to homepage -> https://shop.example.com/\nStep 2: Click 'My Account' link in top navigation\nStep 3: Click 'Order History' tab\nStep 4: Click on the first order (Order #2024-1847, placed Dec 15 2024)\nStep 5: Scroll down to 'Delivery Status' section\nStep 6: STOP - Task complete",
"final_state": "Page displays Order #2024-1847 details. Delivery status shows: 'In Transit - Expected delivery Dec 20, 2024'. Tracking number: 1Z999AA10123456784. Carrier: UPS.",
"website_type": "E-commerce (Shopping)"
},
{
"id": "wa_002",
"text": "Create a new repository named 'test-project' with a README file on the code hosting platform.",
"trajectory": "Step 1: Navigate to https://gitlab.example.com/\nStep 2: Click '+' button in top navigation bar\nStep 3: Click 'New project/repository' from dropdown\nStep 4: Select 'Create blank project'\nStep 5: Type 'test-project' in 'Project name' field\nStep 6: Check 'Initialize repository with a README' checkbox\nStep 7: Click 'Create project' button\nStep 8: STOP - Task complete",
"final_state": "New repository page displayed at gitlab.example.com/user/test-project. Repository contains a single README.md file. Project visibility is set to 'Private' (default). Branch: main.",
"website_type": "Code Hosting (GitLab)"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/webarena-web-agent-eval potato start config.yaml
Dataset & paper
Zhou et al., ICLR 2024
Citation (BibTeX)
@inproceedings{zhou2024webarena,
title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
author={Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham},
booktitle={International Conference on Learning Representations (ICLR)},
year={2024}
}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
AgentRewardBench Trajectory Scoring
Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.
FActScore: Atomic Factual Precision Evaluation for LLMs
FActScore breaks LLM-generated text into atomic facts and scores the percentage supported by Wikipedia, labeling each as supported, not-supported, or irrelevant. Includes dataset and paper links plus a Potato config to reproduce the annotation task.
GPQA - Graduate-Level Expert QA Evaluation
Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., ICLR 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.