AgentRewardBench Trajectory Scoring

Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# AgentRewardBench Trajectory Scoring
# Based on "AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories" (Lù et al., arXiv 2025)
# Task: Rate web agent trajectory steps, judge overall success, and identify evaluator disagreements

annotation_task_name: "AgentRewardBench Trajectory Scoring"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
    <div style="display: flex; gap: 20px;">
      <div style="flex: 2;">
        <div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 16px;">
          <h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Description</h3>
          <p style="margin: 0; font-size: 15px;">{{text}}</p>
          <p style="margin: 8px 0 0 0; font-size: 13px; color: #555;"><strong>Website:</strong> {{website}}</p>
        </div>
        <div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 16px;">
          <h3 style="margin: 0 0 10px 0; color: #2c3e50;">Agent Trajectory</h3>
          <div style="font-size: 14px; line-height: 1.7; white-space: pre-wrap;">{{trajectory}}</div>
        </div>
      </div>
      <div style="flex: 1; min-width: 260px;">
        <div style="background: #fff3cd; border: 1px solid #ffc107; padding: 14px; border-radius: 8px; position: sticky; top: 10px;">
          <h4 style="margin: 0 0 10px 0; color: #856404;">Automatic Evaluator Scores</h4>
          <div style="font-size: 13px; line-height: 1.8; white-space: pre-wrap;">{{auto_scores}}</div>
        </div>
      </div>
    </div>
  </div>

annotation_schemes:
  - annotation_type: multirate
    name: step_scores
    description: "Rate each dimension of the agent trajectory on a 5-point scale."
    labels:
      - "1 - Very Poor"
      - "2 - Poor"
      - "3 - Average"
      - "4 - Good"
      - "5 - Excellent"
    options:
      - "Step Correctness"
      - "Efficiency"
      - "Goal Alignment"
      - "Recovery Quality"
      - "Final Success"

  - name: overall_success
    description: "Judge the overall success of the agent in completing the task."
    annotation_type: radio
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    keyboard_shortcuts:
      "Success": "1"
      "Partial Success": "2"
      "Failure": "3"

  - name: evaluator_disagreement
    description: "Where do automatic evaluators disagree with your judgment? Explain any differences."
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "arb-001",
    "text": "Find the cheapest wireless noise-cancelling headphones with at least 4-star rating and add them to cart.",
    "website": "shopping.example.com",
    "trajectory": "Step 1: Navigate to shopping.example.com\nStep 2: Click on 'Electronics' category\nStep 3: Type 'wireless noise-cancelling headphones' in search bar\nStep 4: Click 'Sort by Price: Low to High'\nStep 5: Scroll down and check ratings filter >= 4 stars\nStep 6: Click on 'SoundMax Pro NC-200' ($49.99, 4.3 stars)\nStep 7: Click 'Add to Cart'\nStep 8: Verify item appears in cart with correct price",
    "auto_scores": "GPT-4-Judge: Success (0.95)\nClaude-Judge: Success (0.91)\nRule-Based: Partial (0.60)\nReward Model v2: Success (0.88)"
  },
  {
    "id": "arb-002",
    "text": "Post a reply to the thread about Python async best practices in the programming forum.",
    "website": "forum.devtalk.example.org",
    "trajectory": "Step 1: Navigate to forum.devtalk.example.org\nStep 2: Click 'Programming Languages' section\nStep 3: Click 'Python' subsection\nStep 4: Search for 'async best practices'\nStep 5: Click on thread 'Best practices for asyncio in production?'\nStep 6: Scroll to bottom of thread\nStep 7: Click 'Reply' button\nStep 8: Type response about using asyncio.gather for concurrent tasks\nStep 9: Click 'Submit Reply'\nStep 10: Verify reply appears in thread",
    "auto_scores": "GPT-4-Judge: Success (0.92)\nClaude-Judge: Success (0.89)\nRule-Based: Success (0.85)\nReward Model v2: Success (0.90)"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/agentrewardbench-trajectory-scoring
potato start config.yaml

Dataset & paper

Lù et al., arXiv 2025

Read the paper ↗

Citation (BibTeX)

bibtex

@article{lu2025agentrewardbench,
  title={AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories},
  author={L\`{u}, Xing Han and Kazemnejad, Amirhossein and Meade, Nicholas and Patel, Arkil and Shin, Dongchan and Zambrano, Alejandra and Sta\'{n}czak, Karolina and Shaw, Peter and Pal, Christopher J. and Reddy, Siva},
  journal={arXiv preprint arXiv:2504.08942},
  year={2025}
}

Details

Annotation Types

multirateradiotext

Domain

Agentic AIWeb AgentsEvaluation

Use Cases

Agent EvaluationTrajectory Scoring

Related Designs

TrajEval Staged Evaluation

Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.

multirateradio

WebArena: Realistic Web Agent Evaluation Benchmark

WebArena is a reproducible, self-hosted web environment with 812 tasks for testing autonomous language agents. This Potato config has annotators score whether an agent's actions completed each task.

radiotext

DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.