tau-bench Agent Evaluation
Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# tau-bench Agent Evaluation
# Based on "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains" (Yao et al., ICLR 2025)
# Task: Evaluate tool-agent-user interactions in customer service domains
annotation_task_name: "tau-bench Agent Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
<div style="display: flex; gap: 12px; margin-bottom: 14px;">
<span style="background: #2c3e50; color: #fff; padding: 4px 12px; border-radius: 12px; font-size: 13px;">{{domain}}</span>
</div>
<div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Specification</h3>
<p style="margin: 0; font-size: 15px;">{{text}}</p>
</div>
<div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 10px 0; color: #2c3e50;">Conversation</h3>
<div style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{conversation}}</div>
</div>
<div style="background: #f0f0f0; border: 1px solid #999; padding: 14px; border-radius: 8px;">
<h4 style="margin: 0 0 8px 0; color: #333;">Final Outcome</h4>
<p style="margin: 0; font-size: 14px;">{{final_outcome}}</p>
</div>
</div>
annotation_schemes:
- name: task_success
description: "Did the agent successfully complete the customer's request?"
annotation_type: radio
labels:
- "Success"
- "Partial Success"
- "Failure"
keyboard_shortcuts:
"Success": "1"
"Partial Success": "2"
"Failure": "3"
- name: conversation_quality
description: "Rate the overall quality of the agent's conversation with the user."
annotation_type: likert
min_label: "Very Poor"
max_label: "Excellent"
size: 5
- name: tool_use_issues
description: "Select all tool use issues observed in the conversation."
annotation_type: multiselect
labels:
- "Tool Call Error"
- "Wrong Arguments"
- "Missing Tool Call"
- "Redundant Tool Call"
- "Correct Tool Use"
- name: eval_rationale
description: "Explain your evaluation: what did the agent do well or poorly?"
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "tau-001",
"text": "Customer wants to return a defective blender purchased 5 days ago and get a full refund to their original payment method.",
"domain": "Retail",
"conversation": "[User] Hi, I bought a blender last week and it stopped working after two uses. I'd like to return it.\n\n[Agent] I'm sorry to hear that! I'd be happy to help you with a return. Could you provide your order number?\n\n[User] Sure, it's ORD-2024-88431.\n\n[Tool Call] lookup_order(order_id=\"ORD-2024-88431\")\n[Tool Result] {\"order_id\": \"ORD-2024-88431\", \"product\": \"BlendMax 3000\", \"price\": 79.99, \"date\": \"2026-03-24\", \"status\": \"delivered\", \"payment\": \"visa_ending_4521\"}\n\n[Agent] I found your order for the BlendMax 3000 ($79.99), purchased on March 24th. Since it's within our 30-day return window, I can process a return for you. Would you like a refund to your Visa ending in 4521?\n\n[User] Yes, please. Full refund.\n\n[Tool Call] initiate_return(order_id=\"ORD-2024-88431\", reason=\"defective\", refund_method=\"original_payment\")\n[Tool Result] {\"return_id\": \"RET-55012\", \"status\": \"approved\", \"refund_amount\": 79.99, \"estimated_refund\": \"3-5 business days\"}\n\n[Agent] Your return has been approved! Return ID is RET-55012. You'll receive a $79.99 refund to your Visa ending in 4521 within 3-5 business days. You'll also receive a prepaid shipping label via email. Is there anything else I can help with?\n\n[User] No, that's all. Thanks!",
"final_outcome": "Return initiated successfully. Refund of $79.99 approved to original payment method. Prepaid label sent."
},
{
"id": "tau-002",
"text": "Customer wants to change their flight from NYC to LA from March 30 to April 2, keeping the same flight time and seat preference.",
"domain": "Airline",
"conversation": "[User] I need to change my flight. Booking reference is FLT-AX7829.\n\n[Agent] Of course! Let me pull up your booking.\n\n[Tool Call] get_booking(reference=\"FLT-AX7829\")\n[Tool Result] {\"reference\": \"FLT-AX7829\", \"route\": \"JFK-LAX\", \"date\": \"2026-03-30\", \"time\": \"08:15\", \"seat\": \"14A\", \"class\": \"economy\", \"passenger\": \"Jordan Rivera\", \"fare\": 289.00}\n\n[Agent] I see your booking, Jordan. You're currently on the JFK to LAX flight on March 30 at 8:15 AM, seat 14A in economy. What date would you like to change to?\n\n[User] April 2nd, same time if possible. And I'd like to keep my window seat.\n\n[Tool Call] search_flights(route=\"JFK-LAX\", date=\"2026-04-02\", time_preference=\"08:00-09:00\", class=\"economy\")\n[Tool Result] {\"flights\": [{\"flight\": \"AX-201\", \"time\": \"08:15\", \"seats_available\": [\"7A\", \"7F\", \"22A\"], \"fare\": 289.00}, {\"flight\": \"AX-205\", \"time\": \"08:45\", \"seats_available\": [\"3C\", \"15F\"], \"fare\": 315.00}]}\n\n[Agent] Great news! There's a flight on April 2nd at 8:15 AM (same time) with window seats available. I can put you in seat 7A. The fare is the same at $289, so no additional charge. Shall I proceed?\n\n[User] Yes, that works perfectly.\n\n[Tool Call] modify_booking(reference=\"FLT-AX7829\", new_flight=\"AX-201\", new_date=\"2026-04-02\", new_seat=\"7A\")\n[Tool Result] {\"status\": \"confirmed\", \"new_reference\": \"FLT-AX7829-M1\", \"change_fee\": 0.00}\n\n[Agent] All set! Your flight has been changed to April 2nd at 8:15 AM, seat 7A (window). No change fee since the fare is the same. Your updated confirmation is FLT-AX7829-M1. Anything else?\n\n[User] Nope, you're great. Thanks!",
"final_outcome": "Flight successfully changed to April 2nd. Same time, window seat, no additional charge."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/tau-bench-agent-eval potato start config.yaml
Dataset & paper
Yao et al., ICLR 2025
Citation (BibTeX)
@inproceedings{yao2025taubench,
title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
author={Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik},
booktitle={International Conference on Learning Representations},
year={2025}
}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
AgentBoard Progress Scoring
Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.
OSWorld: Desktop Agent Task Evaluation
Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.