AgentRewardBench Trajectory Scoring
Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# AgentRewardBench Trajectory Scoring
# Based on "AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories" (Lù et al., arXiv 2025)
# Task: Rate web agent trajectory steps, judge overall success, and identify evaluator disagreements
annotation_task_name: "AgentRewardBench Trajectory Scoring"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
<div style="display: flex; gap: 20px;">
<div style="flex: 2;">
<div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 16px;">
<h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Description</h3>
<p style="margin: 0; font-size: 15px;">{{text}}</p>
<p style="margin: 8px 0 0 0; font-size: 13px; color: #555;"><strong>Website:</strong> {{website}}</p>
</div>
<div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 16px;">
<h3 style="margin: 0 0 10px 0; color: #2c3e50;">Agent Trajectory</h3>
<div style="font-size: 14px; line-height: 1.7; white-space: pre-wrap;">{{trajectory}}</div>
</div>
</div>
<div style="flex: 1; min-width: 260px;">
<div style="background: #fff3cd; border: 1px solid #ffc107; padding: 14px; border-radius: 8px; position: sticky; top: 10px;">
<h4 style="margin: 0 0 10px 0; color: #856404;">Automatic Evaluator Scores</h4>
<div style="font-size: 13px; line-height: 1.8; white-space: pre-wrap;">{{auto_scores}}</div>
</div>
</div>
</div>
</div>
annotation_schemes:
- annotation_type: multirate
name: step_scores
description: "Rate each dimension of the agent trajectory on a 5-point scale."
labels:
- "1 - Very Poor"
- "2 - Poor"
- "3 - Average"
- "4 - Good"
- "5 - Excellent"
options:
- "Step Correctness"
- "Efficiency"
- "Goal Alignment"
- "Recovery Quality"
- "Final Success"
- name: overall_success
description: "Judge the overall success of the agent in completing the task."
annotation_type: radio
labels:
- "Success"
- "Partial Success"
- "Failure"
keyboard_shortcuts:
"Success": "1"
"Partial Success": "2"
"Failure": "3"
- name: evaluator_disagreement
description: "Where do automatic evaluators disagree with your judgment? Explain any differences."
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "arb-001",
"text": "Find the cheapest wireless noise-cancelling headphones with at least 4-star rating and add them to cart.",
"website": "shopping.example.com",
"trajectory": "Step 1: Navigate to shopping.example.com\nStep 2: Click on 'Electronics' category\nStep 3: Type 'wireless noise-cancelling headphones' in search bar\nStep 4: Click 'Sort by Price: Low to High'\nStep 5: Scroll down and check ratings filter >= 4 stars\nStep 6: Click on 'SoundMax Pro NC-200' ($49.99, 4.3 stars)\nStep 7: Click 'Add to Cart'\nStep 8: Verify item appears in cart with correct price",
"auto_scores": "GPT-4-Judge: Success (0.95)\nClaude-Judge: Success (0.91)\nRule-Based: Partial (0.60)\nReward Model v2: Success (0.88)"
},
{
"id": "arb-002",
"text": "Post a reply to the thread about Python async best practices in the programming forum.",
"website": "forum.devtalk.example.org",
"trajectory": "Step 1: Navigate to forum.devtalk.example.org\nStep 2: Click 'Programming Languages' section\nStep 3: Click 'Python' subsection\nStep 4: Search for 'async best practices'\nStep 5: Click on thread 'Best practices for asyncio in production?'\nStep 6: Scroll to bottom of thread\nStep 7: Click 'Reply' button\nStep 8: Type response about using asyncio.gather for concurrent tasks\nStep 9: Click 'Submit Reply'\nStep 10: Verify reply appears in thread",
"auto_scores": "GPT-4-Judge: Success (0.92)\nClaude-Judge: Success (0.89)\nRule-Based: Success (0.85)\nReward Model v2: Success (0.90)"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/agentrewardbench-trajectory-scoring potato start config.yaml
Dataset & paper
Lù et al., arXiv 2025
Citation (BibTeX)
@article{lu2025agentrewardbench,
title={AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories},
author={L\`{u}, Xing Han and Kazemnejad, Amirhossein and Meade, Nicholas and Patel, Arkil and Shin, Dongchan and Zambrano, Alejandra and Sta\'{n}czak, Karolina and Shaw, Peter and Pal, Christopher J. and Reddy, Siva},
journal={arXiv preprint arXiv:2504.08942},
year={2025}
}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
TrajEval Staged Evaluation
Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.
WebArena: Realistic Web Agent Evaluation Benchmark
WebArena is a reproducible, self-hosted web environment with 812 tasks for testing autonomous language agents. This Potato config has annotators score whether an agent's actions completed each task.
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.