Skip to content
Showcase/AgentBoard Progress Scoring
advancedsurvey

AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# AgentBoard Progress Scoring
# Based on "AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents" (Ma et al., NeurIPS 2024)
# Task: Evaluate multi-turn agent progress through milestone tracking and progress scoring

annotation_task_name: "AgentBoard Progress Scoring"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
    <div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Description</h3>
      <p style="margin: 0; font-size: 15px;">{{text}}</p>
      <span style="display: inline-block; margin-top: 8px; background: #2c3e50; color: #fff; padding: 3px 10px; border-radius: 12px; font-size: 12px;">{{environment}}</span>
    </div>
    <div style="background: #eafaf1; border: 1px solid #27ae60; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h4 style="margin: 0 0 8px 0; color: #1e8449;">Milestones Checklist</h4>
      <p style="margin: 0; font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{milestones}}</p>
    </div>
    <details style="margin-bottom: 14px;" open>
      <summary style="cursor: pointer; font-weight: bold; font-size: 15px; padding: 8px; background: #f5f5f5; border-radius: 6px;">Agent Trajectory</summary>
      <div style="padding: 12px; border: 1px solid #ddd; border-top: none; border-radius: 0 0 6px 6px; font-size: 14px; line-height: 1.7; white-space: pre-wrap;">{{trajectory}}</div>
    </details>
    <div style="background: #fdf2e9; border: 1px solid #e67e22; padding: 14px; border-radius: 8px;">
      <h4 style="margin: 0 0 8px 0; color: #a04000;">Final State</h4>
      <p style="margin: 0; font-size: 14px; white-space: pre-wrap;">{{final_state}}</p>
    </div>
  </div>

annotation_schemes:
  - name: milestones_reached
    description: "Select all subtask milestones that the agent successfully achieved."
    annotation_type: multiselect
    labels:
      - "Milestone 1"
      - "Milestone 2"
      - "Milestone 3"
      - "Milestone 4"
      - "Milestone 5"

  - name: progress_score
    description: "Rate the overall progress of the agent toward completing the task."
    annotation_type: likert
    min_label: "1 - No Progress"
    max_label: "5 - Complete"
    size: 5

  - name: agent_category
    description: "Which environment category does this agent task belong to?"
    annotation_type: radio
    labels:
      - "Web Shopping"
      - "Web Browsing"
      - "Tool Use"
      - "Game"
      - "Embodied"
      - "Reasoning"
    keyboard_shortcuts:
      "Web Shopping": "1"
      "Web Browsing": "2"
      "Tool Use": "3"
      "Game": "4"
      "Embodied": "5"
      "Reasoning": "6"

  - name: progress_notes
    description: "Notes on partial progress, missed milestones, or interesting agent behavior."
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "ab-001",
    "text": "Purchase a red cotton t-shirt in size Medium from an online clothing store, applying any available discount code.",
    "environment": "Web Shopping",
    "milestones": "1. Navigate to clothing store\n2. Search for red cotton t-shirt\n3. Select size Medium\n4. Apply discount code\n5. Complete checkout",
    "trajectory": "Turn 1: Navigate to shop.example.com\nTurn 2: Search 'red cotton t-shirt'\nTurn 3: Filter by color: Red, material: Cotton\nTurn 4: Select first result 'Classic Cotton Tee - Red' ($24.99)\nTurn 5: Choose size: Medium\nTurn 6: Add to cart\nTurn 7: Go to checkout\nTurn 8: Look for discount code field\nTurn 9: Try code 'SAVE10' - Invalid\nTurn 10: Try code 'WELCOME' - 15% off applied\nTurn 11: Confirm order details\nTurn 12: Click 'Place Order'",
    "final_state": "Order confirmed. Red cotton t-shirt (M) purchased for $21.24 after 15% discount. Order #SH-90234."
  },
  {
    "id": "ab-002",
    "text": "Use the Wikipedia API to find the population of the 5 largest cities in Japan and calculate their total population.",
    "environment": "Tool Use",
    "milestones": "1. Identify the 5 largest cities in Japan\n2. Query population for each city\n3. Retrieve correct population figures\n4. Calculate total population\n5. Present results clearly",
    "trajectory": "Turn 1: Call wikipedia_search('largest cities in Japan')\nTurn 2: Parse results - identified Tokyo, Yokohama, Osaka, Nagoya, Sapporo\nTurn 3: Call wikipedia_page('Tokyo') - Population: 13,960,000\nTurn 4: Call wikipedia_page('Yokohama') - Population: 3,749,000\nTurn 5: Call wikipedia_page('Osaka') - Population: 2,753,000\nTurn 6: Call wikipedia_page('Nagoya') - Population: 2,296,000\nTurn 7: Call wikipedia_page('Sapporo') - Population: 1,973,000\nTurn 8: Calculate total: 24,731,000",
    "final_state": "Total population of 5 largest Japanese cities: 24,731,000. All data retrieved successfully from Wikipedia API."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/agentboard-progress-scoring
potato start config.yaml

Details

Annotation Types

multiselectlikertradiotext

Domain

Agentic AILLM AgentsBenchmarking

Use Cases

Agent EvaluationProgress Assessment

Tags

multi-turn-agentsprogress-scoringmilestonesllm-benchmarking

Found an issue or want to improve this design?

Open an Issue