WebArena: Web Agent Task Completion Evaluation

Evaluation of autonomous web agent task completions. Annotators verify whether an agent's sequence of web actions correctly completes a given task, assess partial completion, and identify failure modes.

Configuration Fileconfig.yaml

# WebArena: Web Agent Task Completion Evaluation
# Based on "WebArena: A Realistic Web Environment for Building Autonomous Agents" (Zhou et al., ICLR 2024)
# Task: Evaluate whether an agent's actions correctly complete a given web task

annotation_task_name: "WebArena Web Agent Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing task, trajectory, and final state
html_layout: |
  <div class="webarena-container">
    <div class="task-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">Task Instruction:</h3>
      <div class="task-text" style="font-size: 16px; font-weight: bold;">{{text}}</div>
    </div>
    <div class="website-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
      <strong>Website Type:</strong> {{website_type}}
    </div>
    <div class="trajectory-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
      <h3 style="margin-top: 0; color: #f9a825;">Agent Action Trajectory:</h3>
      <div class="trajectory-text" style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
    </div>
    <div class="final-state-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
      <h3 style="margin-top: 0; color: #7b1fa2;">Final State:</h3>
      <div class="final-state-text" style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Task completion assessment
  - name: "task_completion"
    description: "Did the agent successfully complete the task?"
    annotation_type: radio
    labels:
      - "Fully Complete - task achieved correctly"
      - "Partially Complete - some subtasks done"
      - "Failed - task not completed"
      - "Wrong - completed a different task"
    keyboard_shortcuts:
      "Fully Complete - task achieved correctly": "1"
      "Partially Complete - some subtasks done": "2"
      "Failed - task not completed": "3"
      "Wrong - completed a different task": "4"

  # Failure explanation
  - name: "failure_explanation"
    description: "If the task was not fully completed, explain what went wrong. Leave blank for fully completed tasks."
    annotation_type: text
    required: false
    placeholder: "Describe the failure: what step went wrong, what the agent should have done instead..."

  # Failure type (if applicable)
  - name: "failure_type"
    description: "If the task failed or was partially complete, what type of failure occurred?"
    annotation_type: radio
    labels:
      - "Navigation error - went to wrong page/section"
      - "Element selection error - interacted with wrong element"
      - "Input error - typed wrong value or parameters"
      - "Sequence error - correct actions in wrong order"
      - "Missing step - skipped a required action"
      - "Premature termination - stopped too early"
      - "Infinite loop - repeated actions without progress"
      - "N/A - task was fully completed"
    keyboard_shortcuts:
      "Navigation error - went to wrong page/section": "q"
      "Element selection error - interacted with wrong element": "w"
      "Input error - typed wrong value or parameters": "e"
      "Sequence error - correct actions in wrong order": "r"
      "Missing step - skipped a required action": "t"
      "Premature termination - stopped too early": "y"
      "Infinite loop - repeated actions without progress": "u"
      "N/A - task was fully completed": "i"

  # Trajectory efficiency
  - name: "trajectory_efficiency"
    description: "How efficient was the agent's action trajectory?"
    annotation_type: radio
    labels:
      - "Optimal - minimal steps needed"
      - "Acceptable - some unnecessary steps but reasonable"
      - "Inefficient - many unnecessary steps"
      - "Very inefficient - excessive wandering or backtracking"
    keyboard_shortcuts:
      "Optimal - minimal steps needed": "a"
      "Acceptable - some unnecessary steps but reasonable": "s"
      "Inefficient - many unnecessary steps": "d"
      "Very inefficient - excessive wandering or backtracking": "f"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 100
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "wa_001",
    "text": "Find the most recent order on the shopping site and check its delivery status.",
    "trajectory": "Step 1: Navigate to homepage -> https://shop.example.com/\nStep 2: Click 'My Account' link in top navigation\nStep 3: Click 'Order History' tab\nStep 4: Click on the first order (Order #2024-1847, placed Dec 15 2024)\nStep 5: Scroll down to 'Delivery Status' section\nStep 6: STOP - Task complete",
    "final_state": "Page displays Order #2024-1847 details. Delivery status shows: 'In Transit - Expected delivery Dec 20, 2024'. Tracking number: 1Z999AA10123456784. Carrier: UPS.",
    "website_type": "E-commerce (Shopping)"
  },
  {
    "id": "wa_002",
    "text": "Create a new repository named 'test-project' with a README file on the code hosting platform.",
    "trajectory": "Step 1: Navigate to https://gitlab.example.com/\nStep 2: Click '+' button in top navigation bar\nStep 3: Click 'New project/repository' from dropdown\nStep 4: Select 'Create blank project'\nStep 5: Type 'test-project' in 'Project name' field\nStep 6: Check 'Initialize repository with a README' checkbox\nStep 7: Click 'Create project' button\nStep 8: STOP - Task complete",
    "final_state": "New repository page displayed at gitlab.example.com/user/test-project. Repository contains a single README.md file. Project visibility is set to 'Private' (default). Branch: main.",
    "website_type": "Code Hosting (GitLab)"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/webarena-web-agent-eval
potato start config.yaml

Details

Annotation Types

radiotext

Domain

Web AgentsAutonomous SystemsEvaluation

Use Cases

Agent EvaluationTask Completion AssessmentFailure Analysis

Related Designs

AgentRewardBench Trajectory Scoring

Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.

multirateradio

FActScore: Fine-grained Atomic Evaluation of Factual Precision

Atomic fact evaluation in LLM-generated text. Annotators decompose generated text into atomic facts and verify each fact as supported, not-supported, or irrelevant against a reference source. Based on the FActScore framework for evaluating factual precision in long-form text generation.