WebArena: Web Agent Task Completion Evaluation
Evaluation of autonomous web agent task completions. Annotators verify whether an agent's sequence of web actions correctly completes a given task, assess partial completion, and identify failure modes.
設定ファイルconfig.yaml
# WebArena: Web Agent Task Completion Evaluation
# Based on "WebArena: A Realistic Web Environment for Building Autonomous Agents" (Zhou et al., ICLR 2024)
# Task: Evaluate whether an agent's actions correctly complete a given web task
annotation_task_name: "WebArena Web Agent Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing task, trajectory, and final state
html_layout: |
<div class="webarena-container">
<div class="task-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">Task Instruction:</h3>
<div class="task-text" style="font-size: 16px; font-weight: bold;">{{text}}</div>
</div>
<div class="website-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
<strong>Website Type:</strong> {{website_type}}
</div>
<div class="trajectory-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
<h3 style="margin-top: 0; color: #f9a825;">Agent Action Trajectory:</h3>
<div class="trajectory-text" style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
</div>
<div class="final-state-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
<h3 style="margin-top: 0; color: #7b1fa2;">Final State:</h3>
<div class="final-state-text" style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Task completion assessment
- name: "task_completion"
description: "Did the agent successfully complete the task?"
annotation_type: radio
labels:
- "Fully Complete - task achieved correctly"
- "Partially Complete - some subtasks done"
- "Failed - task not completed"
- "Wrong - completed a different task"
keyboard_shortcuts:
"Fully Complete - task achieved correctly": "1"
"Partially Complete - some subtasks done": "2"
"Failed - task not completed": "3"
"Wrong - completed a different task": "4"
# Failure explanation
- name: "failure_explanation"
description: "If the task was not fully completed, explain what went wrong. Leave blank for fully completed tasks."
annotation_type: text
required: false
placeholder: "Describe the failure: what step went wrong, what the agent should have done instead..."
# Failure type (if applicable)
- name: "failure_type"
description: "If the task failed or was partially complete, what type of failure occurred?"
annotation_type: radio
labels:
- "Navigation error - went to wrong page/section"
- "Element selection error - interacted with wrong element"
- "Input error - typed wrong value or parameters"
- "Sequence error - correct actions in wrong order"
- "Missing step - skipped a required action"
- "Premature termination - stopped too early"
- "Infinite loop - repeated actions without progress"
- "N/A - task was fully completed"
keyboard_shortcuts:
"Navigation error - went to wrong page/section": "q"
"Element selection error - interacted with wrong element": "w"
"Input error - typed wrong value or parameters": "e"
"Sequence error - correct actions in wrong order": "r"
"Missing step - skipped a required action": "t"
"Premature termination - stopped too early": "y"
"Infinite loop - repeated actions without progress": "u"
"N/A - task was fully completed": "i"
# Trajectory efficiency
- name: "trajectory_efficiency"
description: "How efficient was the agent's action trajectory?"
annotation_type: radio
labels:
- "Optimal - minimal steps needed"
- "Acceptable - some unnecessary steps but reasonable"
- "Inefficient - many unnecessary steps"
- "Very inefficient - excessive wandering or backtracking"
keyboard_shortcuts:
"Optimal - minimal steps needed": "a"
"Acceptable - some unnecessary steps but reasonable": "s"
"Inefficient - many unnecessary steps": "d"
"Very inefficient - excessive wandering or backtracking": "f"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 100
annotation_per_instance: 2
サンプルデータsample-data.json
[
{
"id": "wa_001",
"text": "Find the most recent order on the shopping site and check its delivery status.",
"trajectory": "Step 1: Navigate to homepage -> https://shop.example.com/\nStep 2: Click 'My Account' link in top navigation\nStep 3: Click 'Order History' tab\nStep 4: Click on the first order (Order #2024-1847, placed Dec 15 2024)\nStep 5: Scroll down to 'Delivery Status' section\nStep 6: STOP - Task complete",
"final_state": "Page displays Order #2024-1847 details. Delivery status shows: 'In Transit - Expected delivery Dec 20, 2024'. Tracking number: 1Z999AA10123456784. Carrier: UPS.",
"website_type": "E-commerce (Shopping)"
},
{
"id": "wa_002",
"text": "Create a new repository named 'test-project' with a README file on the code hosting platform.",
"trajectory": "Step 1: Navigate to https://gitlab.example.com/\nStep 2: Click '+' button in top navigation bar\nStep 3: Click 'New project/repository' from dropdown\nStep 4: Select 'Create blank project'\nStep 5: Type 'test-project' in 'Project name' field\nStep 6: Check 'Initialize repository with a README' checkbox\nStep 7: Click 'Create project' button\nStep 8: STOP - Task complete",
"final_state": "New repository page displayed at gitlab.example.com/user/test-project. Repository contains a single README.md file. Project visibility is set to 'Private' (default). Branch: main.",
"website_type": "Code Hosting (GitLab)"
}
]
// ... and 8 more itemsこのデザインを取得
Clone or download from the repository
クイックスタート:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/webarena-web-agent-eval potato start config.yaml
詳細
アノテーションタイプ
ドメイン
ユースケース
タグ
問題を見つけた場合やデザインを改善したい場合は?
Issueを作成関連デザイン
FActScore: Fine-grained Atomic Evaluation of Factual Precision
Atomic fact evaluation in LLM-generated text. Annotators decompose generated text into atomic facts and verify each fact as supported, not-supported, or irrelevant against a reference source. Based on the FActScore framework for evaluating factual precision in long-form text generation.
GPQA - Graduate-Level Expert QA Evaluation
Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., ICLR 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.
SWE-bench: Code Agent Solution Evaluation
Evaluation of code agent solutions to real GitHub issues. Annotators review code patches generated by AI agents, assess correctness, check test compatibility, and evaluate code quality.