AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# AgentBoard Progress Scoring
# Based on "AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents" (Ma et al., NeurIPS 2024)
# Task: Evaluate multi-turn agent progress through milestone tracking and progress scoring

annotation_task_name: "AgentBoard Progress Scoring"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
    <div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Description</h3>
      <p style="margin: 0; font-size: 15px;">{{text}}</p>
      <span style="display: inline-block; margin-top: 8px; background: #2c3e50; color: #fff; padding: 3px 10px; border-radius: 12px; font-size: 12px;">{{environment}}</span>
    </div>
    <div style="background: #eafaf1; border: 1px solid #27ae60; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h4 style="margin: 0 0 8px 0; color: #1e8449;">Milestones Checklist</h4>
      <p style="margin: 0; font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{milestones}}</p>
    </div>
    <details style="margin-bottom: 14px;" open>
      <summary style="cursor: pointer; font-weight: bold; font-size: 15px; padding: 8px; background: #f5f5f5; border-radius: 6px;">Agent Trajectory</summary>
      <div style="padding: 12px; border: 1px solid #ddd; border-top: none; border-radius: 0 0 6px 6px; font-size: 14px; line-height: 1.7; white-space: pre-wrap;">{{trajectory}}</div>
    </details>
    <div style="background: #fdf2e9; border: 1px solid #e67e22; padding: 14px; border-radius: 8px;">
      <h4 style="margin: 0 0 8px 0; color: #a04000;">Final State</h4>
      <p style="margin: 0; font-size: 14px; white-space: pre-wrap;">{{final_state}}</p>
    </div>
  </div>

annotation_schemes:
  - name: milestones_reached
    description: "Select all subtask milestones that the agent successfully achieved."
    annotation_type: multiselect
    labels:
      - "Milestone 1"
      - "Milestone 2"
      - "Milestone 3"
      - "Milestone 4"
      - "Milestone 5"

  - name: progress_score
    description: "Rate the overall progress of the agent toward completing the task."
    annotation_type: likert
    min_label: "1 - No Progress"
    max_label: "5 - Complete"
    size: 5

  - name: agent_category
    description: "Which environment category does this agent task belong to?"
    annotation_type: radio
    labels:
      - "Web Shopping"
      - "Web Browsing"
      - "Tool Use"
      - "Game"
      - "Embodied"
      - "Reasoning"
    keyboard_shortcuts:
      "Web Shopping": "1"
      "Web Browsing": "2"
      "Tool Use": "3"
      "Game": "4"
      "Embodied": "5"
      "Reasoning": "6"

  - name: progress_notes
    description: "Notes on partial progress, missed milestones, or interesting agent behavior."
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "ab-001",
    "text": "Purchase a red cotton t-shirt in size Medium from an online clothing store, applying any available discount code.",
    "environment": "Web Shopping",
    "milestones": "1. Navigate to clothing store\n2. Search for red cotton t-shirt\n3. Select size Medium\n4. Apply discount code\n5. Complete checkout",
    "trajectory": "Turn 1: Navigate to shop.example.com\nTurn 2: Search 'red cotton t-shirt'\nTurn 3: Filter by color: Red, material: Cotton\nTurn 4: Select first result 'Classic Cotton Tee - Red' ($24.99)\nTurn 5: Choose size: Medium\nTurn 6: Add to cart\nTurn 7: Go to checkout\nTurn 8: Look for discount code field\nTurn 9: Try code 'SAVE10' - Invalid\nTurn 10: Try code 'WELCOME' - 15% off applied\nTurn 11: Confirm order details\nTurn 12: Click 'Place Order'",
    "final_state": "Order confirmed. Red cotton t-shirt (M) purchased for $21.24 after 15% discount. Order #SH-90234."
  },
  {
    "id": "ab-002",
    "text": "Use the Wikipedia API to find the population of the 5 largest cities in Japan and calculate their total population.",
    "environment": "Tool Use",
    "milestones": "1. Identify the 5 largest cities in Japan\n2. Query population for each city\n3. Retrieve correct population figures\n4. Calculate total population\n5. Present results clearly",
    "trajectory": "Turn 1: Call wikipedia_search('largest cities in Japan')\nTurn 2: Parse results - identified Tokyo, Yokohama, Osaka, Nagoya, Sapporo\nTurn 3: Call wikipedia_page('Tokyo') - Population: 13,960,000\nTurn 4: Call wikipedia_page('Yokohama') - Population: 3,749,000\nTurn 5: Call wikipedia_page('Osaka') - Population: 2,753,000\nTurn 6: Call wikipedia_page('Nagoya') - Population: 2,296,000\nTurn 7: Call wikipedia_page('Sapporo') - Population: 1,973,000\nTurn 8: Calculate total: 24,731,000",
    "final_state": "Total population of 5 largest Japanese cities: 24,731,000. All data retrieved successfully from Wikipedia API."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/agentboard-progress-scoring
potato start config.yaml

Dataset & paper

Ma et al., NeurIPS 2024

Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{ma2024agentboard,
  title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents},
  author={Ma, Chang and Zhang, Junlei and Zhu, Zhihao and Yang, Cheng and Yang, Yujiu and Jin, Yaohui and Lan, Zhenzhong and Kong, Lingpeng and He, Junxian},
  booktitle={Advances in Neural Information Processing Systems},
  year={2024}
}

Details

Annotation Types

multiselectlikertradiotext

Domain

Agentic AILLM AgentsBenchmarking

Use Cases

Agent EvaluationProgress Assessment

Related Designs

tau-bench Agent Evaluation

Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.

radiolikert

OSWorld: Desktop Agent Task Evaluation

Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.

radiomultiselect

RefactorBench Multi-File Evaluation

Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.