tau-bench Agent Evaluation

Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# tau-bench Agent Evaluation
# Based on "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains" (Yao et al., ICLR 2025)
# Task: Evaluate tool-agent-user interactions in customer service domains

annotation_task_name: "tau-bench Agent Evaluation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
    <div style="display: flex; gap: 12px; margin-bottom: 14px;">
      <span style="background: #2c3e50; color: #fff; padding: 4px 12px; border-radius: 12px; font-size: 13px;">{{domain}}</span>
    </div>
    <div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Specification</h3>
      <p style="margin: 0; font-size: 15px;">{{text}}</p>
    </div>
    <div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 10px 0; color: #2c3e50;">Conversation</h3>
      <div style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{conversation}}</div>
    </div>
    <div style="background: #f0f0f0; border: 1px solid #999; padding: 14px; border-radius: 8px;">
      <h4 style="margin: 0 0 8px 0; color: #333;">Final Outcome</h4>
      <p style="margin: 0; font-size: 14px;">{{final_outcome}}</p>
    </div>
  </div>

annotation_schemes:
  - name: task_success
    description: "Did the agent successfully complete the customer's request?"
    annotation_type: radio
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    keyboard_shortcuts:
      "Success": "1"
      "Partial Success": "2"
      "Failure": "3"

  - name: conversation_quality
    description: "Rate the overall quality of the agent's conversation with the user."
    annotation_type: likert
    min_label: "Very Poor"
    max_label: "Excellent"
    size: 5

  - name: tool_use_issues
    description: "Select all tool use issues observed in the conversation."
    annotation_type: multiselect
    labels:
      - "Tool Call Error"
      - "Wrong Arguments"
      - "Missing Tool Call"
      - "Redundant Tool Call"
      - "Correct Tool Use"

  - name: eval_rationale
    description: "Explain your evaluation: what did the agent do well or poorly?"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "tau-001",
    "text": "Customer wants to return a defective blender purchased 5 days ago and get a full refund to their original payment method.",
    "domain": "Retail",
    "conversation": "[User] Hi, I bought a blender last week and it stopped working after two uses. I'd like to return it.\n\n[Agent] I'm sorry to hear that! I'd be happy to help you with a return. Could you provide your order number?\n\n[User] Sure, it's ORD-2024-88431.\n\n[Tool Call] lookup_order(order_id=\"ORD-2024-88431\")\n[Tool Result] {\"order_id\": \"ORD-2024-88431\", \"product\": \"BlendMax 3000\", \"price\": 79.99, \"date\": \"2026-03-24\", \"status\": \"delivered\", \"payment\": \"visa_ending_4521\"}\n\n[Agent] I found your order for the BlendMax 3000 ($79.99), purchased on March 24th. Since it's within our 30-day return window, I can process a return for you. Would you like a refund to your Visa ending in 4521?\n\n[User] Yes, please. Full refund.\n\n[Tool Call] initiate_return(order_id=\"ORD-2024-88431\", reason=\"defective\", refund_method=\"original_payment\")\n[Tool Result] {\"return_id\": \"RET-55012\", \"status\": \"approved\", \"refund_amount\": 79.99, \"estimated_refund\": \"3-5 business days\"}\n\n[Agent] Your return has been approved! Return ID is RET-55012. You'll receive a $79.99 refund to your Visa ending in 4521 within 3-5 business days. You'll also receive a prepaid shipping label via email. Is there anything else I can help with?\n\n[User] No, that's all. Thanks!",
    "final_outcome": "Return initiated successfully. Refund of $79.99 approved to original payment method. Prepaid label sent."
  },
  {
    "id": "tau-002",
    "text": "Customer wants to change their flight from NYC to LA from March 30 to April 2, keeping the same flight time and seat preference.",
    "domain": "Airline",
    "conversation": "[User] I need to change my flight. Booking reference is FLT-AX7829.\n\n[Agent] Of course! Let me pull up your booking.\n\n[Tool Call] get_booking(reference=\"FLT-AX7829\")\n[Tool Result] {\"reference\": \"FLT-AX7829\", \"route\": \"JFK-LAX\", \"date\": \"2026-03-30\", \"time\": \"08:15\", \"seat\": \"14A\", \"class\": \"economy\", \"passenger\": \"Jordan Rivera\", \"fare\": 289.00}\n\n[Agent] I see your booking, Jordan. You're currently on the JFK to LAX flight on March 30 at 8:15 AM, seat 14A in economy. What date would you like to change to?\n\n[User] April 2nd, same time if possible. And I'd like to keep my window seat.\n\n[Tool Call] search_flights(route=\"JFK-LAX\", date=\"2026-04-02\", time_preference=\"08:00-09:00\", class=\"economy\")\n[Tool Result] {\"flights\": [{\"flight\": \"AX-201\", \"time\": \"08:15\", \"seats_available\": [\"7A\", \"7F\", \"22A\"], \"fare\": 289.00}, {\"flight\": \"AX-205\", \"time\": \"08:45\", \"seats_available\": [\"3C\", \"15F\"], \"fare\": 315.00}]}\n\n[Agent] Great news! There's a flight on April 2nd at 8:15 AM (same time) with window seats available. I can put you in seat 7A. The fare is the same at $289, so no additional charge. Shall I proceed?\n\n[User] Yes, that works perfectly.\n\n[Tool Call] modify_booking(reference=\"FLT-AX7829\", new_flight=\"AX-201\", new_date=\"2026-04-02\", new_seat=\"7A\")\n[Tool Result] {\"status\": \"confirmed\", \"new_reference\": \"FLT-AX7829-M1\", \"change_fee\": 0.00}\n\n[Agent] All set! Your flight has been changed to April 2nd at 8:15 AM, seat 7A (window). No change fee since the fare is the same. Your updated confirmation is FLT-AX7829-M1. Anything else?\n\n[User] Nope, you're great. Thanks!",
    "final_outcome": "Flight successfully changed to April 2nd. Same time, window seat, no additional charge."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/tau-bench-agent-eval
potato start config.yaml

Dataset & paper

Yao et al., ICLR 2025

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{yao2025taubench,
  title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
  author={Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

Details

Annotation Types

radiolikertmultiselecttext

Domain

Agentic AICustomer ServiceTool Use

Use Cases

Agent EvaluationConversation Quality

Related Designs

AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

multiselectlikert

OSWorld: Desktop Agent Task Evaluation

Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.

radiomultiselect

RefactorBench Multi-File Evaluation

Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.