AgentBoard Progress Scoring
Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# AgentBoard Progress Scoring
# Based on "AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents" (Ma et al., NeurIPS 2024)
# Task: Evaluate multi-turn agent progress through milestone tracking and progress scoring
annotation_task_name: "AgentBoard Progress Scoring"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
<div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Description</h3>
<p style="margin: 0; font-size: 15px;">{{text}}</p>
<span style="display: inline-block; margin-top: 8px; background: #2c3e50; color: #fff; padding: 3px 10px; border-radius: 12px; font-size: 12px;">{{environment}}</span>
</div>
<div style="background: #eafaf1; border: 1px solid #27ae60; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h4 style="margin: 0 0 8px 0; color: #1e8449;">Milestones Checklist</h4>
<p style="margin: 0; font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{milestones}}</p>
</div>
<details style="margin-bottom: 14px;" open>
<summary style="cursor: pointer; font-weight: bold; font-size: 15px; padding: 8px; background: #f5f5f5; border-radius: 6px;">Agent Trajectory</summary>
<div style="padding: 12px; border: 1px solid #ddd; border-top: none; border-radius: 0 0 6px 6px; font-size: 14px; line-height: 1.7; white-space: pre-wrap;">{{trajectory}}</div>
</details>
<div style="background: #fdf2e9; border: 1px solid #e67e22; padding: 14px; border-radius: 8px;">
<h4 style="margin: 0 0 8px 0; color: #a04000;">Final State</h4>
<p style="margin: 0; font-size: 14px; white-space: pre-wrap;">{{final_state}}</p>
</div>
</div>
annotation_schemes:
- name: milestones_reached
description: "Select all subtask milestones that the agent successfully achieved."
annotation_type: multiselect
labels:
- "Milestone 1"
- "Milestone 2"
- "Milestone 3"
- "Milestone 4"
- "Milestone 5"
- name: progress_score
description: "Rate the overall progress of the agent toward completing the task."
annotation_type: likert
min_label: "1 - No Progress"
max_label: "5 - Complete"
size: 5
- name: agent_category
description: "Which environment category does this agent task belong to?"
annotation_type: radio
labels:
- "Web Shopping"
- "Web Browsing"
- "Tool Use"
- "Game"
- "Embodied"
- "Reasoning"
keyboard_shortcuts:
"Web Shopping": "1"
"Web Browsing": "2"
"Tool Use": "3"
"Game": "4"
"Embodied": "5"
"Reasoning": "6"
- name: progress_notes
description: "Notes on partial progress, missed milestones, or interesting agent behavior."
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "ab-001",
"text": "Purchase a red cotton t-shirt in size Medium from an online clothing store, applying any available discount code.",
"environment": "Web Shopping",
"milestones": "1. Navigate to clothing store\n2. Search for red cotton t-shirt\n3. Select size Medium\n4. Apply discount code\n5. Complete checkout",
"trajectory": "Turn 1: Navigate to shop.example.com\nTurn 2: Search 'red cotton t-shirt'\nTurn 3: Filter by color: Red, material: Cotton\nTurn 4: Select first result 'Classic Cotton Tee - Red' ($24.99)\nTurn 5: Choose size: Medium\nTurn 6: Add to cart\nTurn 7: Go to checkout\nTurn 8: Look for discount code field\nTurn 9: Try code 'SAVE10' - Invalid\nTurn 10: Try code 'WELCOME' - 15% off applied\nTurn 11: Confirm order details\nTurn 12: Click 'Place Order'",
"final_state": "Order confirmed. Red cotton t-shirt (M) purchased for $21.24 after 15% discount. Order #SH-90234."
},
{
"id": "ab-002",
"text": "Use the Wikipedia API to find the population of the 5 largest cities in Japan and calculate their total population.",
"environment": "Tool Use",
"milestones": "1. Identify the 5 largest cities in Japan\n2. Query population for each city\n3. Retrieve correct population figures\n4. Calculate total population\n5. Present results clearly",
"trajectory": "Turn 1: Call wikipedia_search('largest cities in Japan')\nTurn 2: Parse results - identified Tokyo, Yokohama, Osaka, Nagoya, Sapporo\nTurn 3: Call wikipedia_page('Tokyo') - Population: 13,960,000\nTurn 4: Call wikipedia_page('Yokohama') - Population: 3,749,000\nTurn 5: Call wikipedia_page('Osaka') - Population: 2,753,000\nTurn 6: Call wikipedia_page('Nagoya') - Population: 2,296,000\nTurn 7: Call wikipedia_page('Sapporo') - Population: 1,973,000\nTurn 8: Calculate total: 24,731,000",
"final_state": "Total population of 5 largest Japanese cities: 24,731,000. All data retrieved successfully from Wikipedia API."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/agentboard-progress-scoring potato start config.yaml
Dataset & paper
Ma et al., NeurIPS 2024
Citation (BibTeX)
@inproceedings{ma2024agentboard,
title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents},
author={Ma, Chang and Zhang, Junlei and Zhu, Zhihao and Yang, Cheng and Yang, Yujiu and Jin, Yaohui and Lan, Zhenzhong and Kong, Lingpeng and He, Junxian},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
tau-bench Agent Evaluation
Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.
OSWorld: Desktop Agent Task Evaluation
Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.