AgentRewardBench Trajectory Scoring
Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.
Configuration Fileconfig.yaml
# AgentRewardBench Trajectory Scoring
# Based on "AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories" (Lù et al., arXiv 2025)
# Task: Rate web agent trajectory steps, judge overall success, and identify evaluator disagreements
annotation_task_name: "AgentRewardBench Trajectory Scoring"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
<div style="display: flex; gap: 20px;">
<div style="flex: 2;">
<div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 16px;">
<h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Description</h3>
<p style="margin: 0; font-size: 15px;">{{text}}</p>
<p style="margin: 8px 0 0 0; font-size: 13px; color: #555;"><strong>Website:</strong> {{website}}</p>
</div>
<div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 16px;">
<h3 style="margin: 0 0 10px 0; color: #2c3e50;">Agent Trajectory</h3>
<div style="font-size: 14px; line-height: 1.7; white-space: pre-wrap;">{{trajectory}}</div>
</div>
</div>
<div style="flex: 1; min-width: 260px;">
<div style="background: #fff3cd; border: 1px solid #ffc107; padding: 14px; border-radius: 8px; position: sticky; top: 10px;">
<h4 style="margin: 0 0 10px 0; color: #856404;">Automatic Evaluator Scores</h4>
<div style="font-size: 13px; line-height: 1.8; white-space: pre-wrap;">{{auto_scores}}</div>
</div>
</div>
</div>
</div>
annotation_schemes:
- annotation_type: multirate
name: step_scores
description: "Rate each dimension of the agent trajectory on a 5-point scale."
labels:
- "1 - Very Poor"
- "2 - Poor"
- "3 - Average"
- "4 - Good"
- "5 - Excellent"
options:
- "Step Correctness"
- "Efficiency"
- "Goal Alignment"
- "Recovery Quality"
- "Final Success"
- name: overall_success
description: "Judge the overall success of the agent in completing the task."
annotation_type: radio
labels:
- "Success"
- "Partial Success"
- "Failure"
keyboard_shortcuts:
"Success": "1"
"Partial Success": "2"
"Failure": "3"
- name: evaluator_disagreement
description: "Where do automatic evaluators disagree with your judgment? Explain any differences."
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "arb-001",
"text": "Find the cheapest wireless noise-cancelling headphones with at least 4-star rating and add them to cart.",
"website": "shopping.example.com",
"trajectory": "Step 1: Navigate to shopping.example.com\nStep 2: Click on 'Electronics' category\nStep 3: Type 'wireless noise-cancelling headphones' in search bar\nStep 4: Click 'Sort by Price: Low to High'\nStep 5: Scroll down and check ratings filter >= 4 stars\nStep 6: Click on 'SoundMax Pro NC-200' ($49.99, 4.3 stars)\nStep 7: Click 'Add to Cart'\nStep 8: Verify item appears in cart with correct price",
"auto_scores": "GPT-4-Judge: Success (0.95)\nClaude-Judge: Success (0.91)\nRule-Based: Partial (0.60)\nReward Model v2: Success (0.88)"
},
{
"id": "arb-002",
"text": "Post a reply to the thread about Python async best practices in the programming forum.",
"website": "forum.devtalk.example.org",
"trajectory": "Step 1: Navigate to forum.devtalk.example.org\nStep 2: Click 'Programming Languages' section\nStep 3: Click 'Python' subsection\nStep 4: Search for 'async best practices'\nStep 5: Click on thread 'Best practices for asyncio in production?'\nStep 6: Scroll to bottom of thread\nStep 7: Click 'Reply' button\nStep 8: Type response about using asyncio.gather for concurrent tasks\nStep 9: Click 'Submit Reply'\nStep 10: Verify reply appears in thread",
"auto_scores": "GPT-4-Judge: Success (0.92)\nClaude-Judge: Success (0.89)\nRule-Based: Success (0.85)\nReward Model v2: Success (0.90)"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/agentrewardbench-trajectory-scoring potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
TrajEval Staged Evaluation
Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.
WebArena: Web Agent Task Completion Evaluation
Evaluation of autonomous web agent task completions. Annotators verify whether an agent's sequence of web actions correctly completes a given task, assess partial completion, and identify failure modes.
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.