AgentBoard Progress Scoring
Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.
Configuration Fileconfig.yaml
# AgentBoard Progress Scoring
# Based on "AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents" (Ma et al., NeurIPS 2024)
# Task: Evaluate multi-turn agent progress through milestone tracking and progress scoring
annotation_task_name: "AgentBoard Progress Scoring"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
<div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Description</h3>
<p style="margin: 0; font-size: 15px;">{{text}}</p>
<span style="display: inline-block; margin-top: 8px; background: #2c3e50; color: #fff; padding: 3px 10px; border-radius: 12px; font-size: 12px;">{{environment}}</span>
</div>
<div style="background: #eafaf1; border: 1px solid #27ae60; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h4 style="margin: 0 0 8px 0; color: #1e8449;">Milestones Checklist</h4>
<p style="margin: 0; font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{milestones}}</p>
</div>
<details style="margin-bottom: 14px;" open>
<summary style="cursor: pointer; font-weight: bold; font-size: 15px; padding: 8px; background: #f5f5f5; border-radius: 6px;">Agent Trajectory</summary>
<div style="padding: 12px; border: 1px solid #ddd; border-top: none; border-radius: 0 0 6px 6px; font-size: 14px; line-height: 1.7; white-space: pre-wrap;">{{trajectory}}</div>
</details>
<div style="background: #fdf2e9; border: 1px solid #e67e22; padding: 14px; border-radius: 8px;">
<h4 style="margin: 0 0 8px 0; color: #a04000;">Final State</h4>
<p style="margin: 0; font-size: 14px; white-space: pre-wrap;">{{final_state}}</p>
</div>
</div>
annotation_schemes:
- name: milestones_reached
description: "Select all subtask milestones that the agent successfully achieved."
annotation_type: multiselect
labels:
- "Milestone 1"
- "Milestone 2"
- "Milestone 3"
- "Milestone 4"
- "Milestone 5"
- name: progress_score
description: "Rate the overall progress of the agent toward completing the task."
annotation_type: likert
min_label: "1 - No Progress"
max_label: "5 - Complete"
size: 5
- name: agent_category
description: "Which environment category does this agent task belong to?"
annotation_type: radio
labels:
- "Web Shopping"
- "Web Browsing"
- "Tool Use"
- "Game"
- "Embodied"
- "Reasoning"
keyboard_shortcuts:
"Web Shopping": "1"
"Web Browsing": "2"
"Tool Use": "3"
"Game": "4"
"Embodied": "5"
"Reasoning": "6"
- name: progress_notes
description: "Notes on partial progress, missed milestones, or interesting agent behavior."
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "ab-001",
"text": "Purchase a red cotton t-shirt in size Medium from an online clothing store, applying any available discount code.",
"environment": "Web Shopping",
"milestones": "1. Navigate to clothing store\n2. Search for red cotton t-shirt\n3. Select size Medium\n4. Apply discount code\n5. Complete checkout",
"trajectory": "Turn 1: Navigate to shop.example.com\nTurn 2: Search 'red cotton t-shirt'\nTurn 3: Filter by color: Red, material: Cotton\nTurn 4: Select first result 'Classic Cotton Tee - Red' ($24.99)\nTurn 5: Choose size: Medium\nTurn 6: Add to cart\nTurn 7: Go to checkout\nTurn 8: Look for discount code field\nTurn 9: Try code 'SAVE10' - Invalid\nTurn 10: Try code 'WELCOME' - 15% off applied\nTurn 11: Confirm order details\nTurn 12: Click 'Place Order'",
"final_state": "Order confirmed. Red cotton t-shirt (M) purchased for $21.24 after 15% discount. Order #SH-90234."
},
{
"id": "ab-002",
"text": "Use the Wikipedia API to find the population of the 5 largest cities in Japan and calculate their total population.",
"environment": "Tool Use",
"milestones": "1. Identify the 5 largest cities in Japan\n2. Query population for each city\n3. Retrieve correct population figures\n4. Calculate total population\n5. Present results clearly",
"trajectory": "Turn 1: Call wikipedia_search('largest cities in Japan')\nTurn 2: Parse results - identified Tokyo, Yokohama, Osaka, Nagoya, Sapporo\nTurn 3: Call wikipedia_page('Tokyo') - Population: 13,960,000\nTurn 4: Call wikipedia_page('Yokohama') - Population: 3,749,000\nTurn 5: Call wikipedia_page('Osaka') - Population: 2,753,000\nTurn 6: Call wikipedia_page('Nagoya') - Population: 2,296,000\nTurn 7: Call wikipedia_page('Sapporo') - Population: 1,973,000\nTurn 8: Calculate total: 24,731,000",
"final_state": "Total population of 5 largest Japanese cities: 24,731,000. All data retrieved successfully from Wikipedia API."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/agentboard-progress-scoring potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
tau-bench Agent Evaluation
Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.
OSWorld: Desktop Agent Task Evaluation
Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.