WebArena: Web Agent Task Completion Evaluation
Evaluation of autonomous web agent task completions. Annotators verify whether an agent's sequence of web actions correctly completes a given task, assess partial completion, and identify failure modes.
Configuration Fileconfig.yaml
# WebArena: Web Agent Task Completion Evaluation
# Based on "WebArena: A Realistic Web Environment for Building Autonomous Agents" (Zhou et al., ICLR 2024)
# Task: Evaluate whether an agent's actions correctly complete a given web task
annotation_task_name: "WebArena Web Agent Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing task, trajectory, and final state
html_layout: |
<div class="webarena-container">
<div class="task-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">Task Instruction:</h3>
<div class="task-text" style="font-size: 16px; font-weight: bold;">{{text}}</div>
</div>
<div class="website-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
<strong>Website Type:</strong> {{website_type}}
</div>
<div class="trajectory-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
<h3 style="margin-top: 0; color: #f9a825;">Agent Action Trajectory:</h3>
<div class="trajectory-text" style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
</div>
<div class="final-state-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
<h3 style="margin-top: 0; color: #7b1fa2;">Final State:</h3>
<div class="final-state-text" style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Task completion assessment
- name: "task_completion"
description: "Did the agent successfully complete the task?"
annotation_type: radio
labels:
- "Fully Complete - task achieved correctly"
- "Partially Complete - some subtasks done"
- "Failed - task not completed"
- "Wrong - completed a different task"
keyboard_shortcuts:
"Fully Complete - task achieved correctly": "1"
"Partially Complete - some subtasks done": "2"
"Failed - task not completed": "3"
"Wrong - completed a different task": "4"
# Failure explanation
- name: "failure_explanation"
description: "If the task was not fully completed, explain what went wrong. Leave blank for fully completed tasks."
annotation_type: text
required: false
placeholder: "Describe the failure: what step went wrong, what the agent should have done instead..."
# Failure type (if applicable)
- name: "failure_type"
description: "If the task failed or was partially complete, what type of failure occurred?"
annotation_type: radio
labels:
- "Navigation error - went to wrong page/section"
- "Element selection error - interacted with wrong element"
- "Input error - typed wrong value or parameters"
- "Sequence error - correct actions in wrong order"
- "Missing step - skipped a required action"
- "Premature termination - stopped too early"
- "Infinite loop - repeated actions without progress"
- "N/A - task was fully completed"
keyboard_shortcuts:
"Navigation error - went to wrong page/section": "q"
"Element selection error - interacted with wrong element": "w"
"Input error - typed wrong value or parameters": "e"
"Sequence error - correct actions in wrong order": "r"
"Missing step - skipped a required action": "t"
"Premature termination - stopped too early": "y"
"Infinite loop - repeated actions without progress": "u"
"N/A - task was fully completed": "i"
# Trajectory efficiency
- name: "trajectory_efficiency"
description: "How efficient was the agent's action trajectory?"
annotation_type: radio
labels:
- "Optimal - minimal steps needed"
- "Acceptable - some unnecessary steps but reasonable"
- "Inefficient - many unnecessary steps"
- "Very inefficient - excessive wandering or backtracking"
keyboard_shortcuts:
"Optimal - minimal steps needed": "a"
"Acceptable - some unnecessary steps but reasonable": "s"
"Inefficient - many unnecessary steps": "d"
"Very inefficient - excessive wandering or backtracking": "f"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 100
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "wa_001",
"text": "Find the most recent order on the shopping site and check its delivery status.",
"trajectory": "Step 1: Navigate to homepage -> https://shop.example.com/\nStep 2: Click 'My Account' link in top navigation\nStep 3: Click 'Order History' tab\nStep 4: Click on the first order (Order #2024-1847, placed Dec 15 2024)\nStep 5: Scroll down to 'Delivery Status' section\nStep 6: STOP - Task complete",
"final_state": "Page displays Order #2024-1847 details. Delivery status shows: 'In Transit - Expected delivery Dec 20, 2024'. Tracking number: 1Z999AA10123456784. Carrier: UPS.",
"website_type": "E-commerce (Shopping)"
},
{
"id": "wa_002",
"text": "Create a new repository named 'test-project' with a README file on the code hosting platform.",
"trajectory": "Step 1: Navigate to https://gitlab.example.com/\nStep 2: Click '+' button in top navigation bar\nStep 3: Click 'New project/repository' from dropdown\nStep 4: Select 'Create blank project'\nStep 5: Type 'test-project' in 'Project name' field\nStep 6: Check 'Initialize repository with a README' checkbox\nStep 7: Click 'Create project' button\nStep 8: STOP - Task complete",
"final_state": "New repository page displayed at gitlab.example.com/user/test-project. Repository contains a single README.md file. Project visibility is set to 'Private' (default). Branch: main.",
"website_type": "Code Hosting (GitLab)"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/webarena-web-agent-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
AgentRewardBench Trajectory Scoring
Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.
FActScore: Fine-grained Atomic Evaluation of Factual Precision
Atomic fact evaluation in LLM-generated text. Annotators decompose generated text into atomic facts and verify each fact as supported, not-supported, or irrelevant against a reference source. Based on the FActScore framework for evaluating factual precision in long-form text generation.
GPQA - Graduate-Level Expert QA Evaluation
Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., ICLR 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.