tau-bench Agent Evaluation
Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.
Configuration Fileconfig.yaml
# tau-bench Agent Evaluation
# Based on "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains" (Yao et al., ICLR 2025)
# Task: Evaluate tool-agent-user interactions in customer service domains
annotation_task_name: "tau-bench Agent Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
<div style="display: flex; gap: 12px; margin-bottom: 14px;">
<span style="background: #2c3e50; color: #fff; padding: 4px 12px; border-radius: 12px; font-size: 13px;">{{domain}}</span>
</div>
<div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Specification</h3>
<p style="margin: 0; font-size: 15px;">{{text}}</p>
</div>
<div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 10px 0; color: #2c3e50;">Conversation</h3>
<div style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{conversation}}</div>
</div>
<div style="background: #f0f0f0; border: 1px solid #999; padding: 14px; border-radius: 8px;">
<h4 style="margin: 0 0 8px 0; color: #333;">Final Outcome</h4>
<p style="margin: 0; font-size: 14px;">{{final_outcome}}</p>
</div>
</div>
annotation_schemes:
- name: task_success
description: "Did the agent successfully complete the customer's request?"
annotation_type: radio
labels:
- "Success"
- "Partial Success"
- "Failure"
keyboard_shortcuts:
"Success": "1"
"Partial Success": "2"
"Failure": "3"
- name: conversation_quality
description: "Rate the overall quality of the agent's conversation with the user."
annotation_type: likert
min_label: "Very Poor"
max_label: "Excellent"
size: 5
- name: tool_use_issues
description: "Select all tool use issues observed in the conversation."
annotation_type: multiselect
labels:
- "Tool Call Error"
- "Wrong Arguments"
- "Missing Tool Call"
- "Redundant Tool Call"
- "Correct Tool Use"
- name: eval_rationale
description: "Explain your evaluation: what did the agent do well or poorly?"
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "tau-001",
"text": "Customer wants to return a defective blender purchased 5 days ago and get a full refund to their original payment method.",
"domain": "Retail",
"conversation": "[User] Hi, I bought a blender last week and it stopped working after two uses. I'd like to return it.\n\n[Agent] I'm sorry to hear that! I'd be happy to help you with a return. Could you provide your order number?\n\n[User] Sure, it's ORD-2024-88431.\n\n[Tool Call] lookup_order(order_id=\"ORD-2024-88431\")\n[Tool Result] {\"order_id\": \"ORD-2024-88431\", \"product\": \"BlendMax 3000\", \"price\": 79.99, \"date\": \"2026-03-24\", \"status\": \"delivered\", \"payment\": \"visa_ending_4521\"}\n\n[Agent] I found your order for the BlendMax 3000 ($79.99), purchased on March 24th. Since it's within our 30-day return window, I can process a return for you. Would you like a refund to your Visa ending in 4521?\n\n[User] Yes, please. Full refund.\n\n[Tool Call] initiate_return(order_id=\"ORD-2024-88431\", reason=\"defective\", refund_method=\"original_payment\")\n[Tool Result] {\"return_id\": \"RET-55012\", \"status\": \"approved\", \"refund_amount\": 79.99, \"estimated_refund\": \"3-5 business days\"}\n\n[Agent] Your return has been approved! Return ID is RET-55012. You'll receive a $79.99 refund to your Visa ending in 4521 within 3-5 business days. You'll also receive a prepaid shipping label via email. Is there anything else I can help with?\n\n[User] No, that's all. Thanks!",
"final_outcome": "Return initiated successfully. Refund of $79.99 approved to original payment method. Prepaid label sent."
},
{
"id": "tau-002",
"text": "Customer wants to change their flight from NYC to LA from March 30 to April 2, keeping the same flight time and seat preference.",
"domain": "Airline",
"conversation": "[User] I need to change my flight. Booking reference is FLT-AX7829.\n\n[Agent] Of course! Let me pull up your booking.\n\n[Tool Call] get_booking(reference=\"FLT-AX7829\")\n[Tool Result] {\"reference\": \"FLT-AX7829\", \"route\": \"JFK-LAX\", \"date\": \"2026-03-30\", \"time\": \"08:15\", \"seat\": \"14A\", \"class\": \"economy\", \"passenger\": \"Jordan Rivera\", \"fare\": 289.00}\n\n[Agent] I see your booking, Jordan. You're currently on the JFK to LAX flight on March 30 at 8:15 AM, seat 14A in economy. What date would you like to change to?\n\n[User] April 2nd, same time if possible. And I'd like to keep my window seat.\n\n[Tool Call] search_flights(route=\"JFK-LAX\", date=\"2026-04-02\", time_preference=\"08:00-09:00\", class=\"economy\")\n[Tool Result] {\"flights\": [{\"flight\": \"AX-201\", \"time\": \"08:15\", \"seats_available\": [\"7A\", \"7F\", \"22A\"], \"fare\": 289.00}, {\"flight\": \"AX-205\", \"time\": \"08:45\", \"seats_available\": [\"3C\", \"15F\"], \"fare\": 315.00}]}\n\n[Agent] Great news! There's a flight on April 2nd at 8:15 AM (same time) with window seats available. I can put you in seat 7A. The fare is the same at $289, so no additional charge. Shall I proceed?\n\n[User] Yes, that works perfectly.\n\n[Tool Call] modify_booking(reference=\"FLT-AX7829\", new_flight=\"AX-201\", new_date=\"2026-04-02\", new_seat=\"7A\")\n[Tool Result] {\"status\": \"confirmed\", \"new_reference\": \"FLT-AX7829-M1\", \"change_fee\": 0.00}\n\n[Agent] All set! Your flight has been changed to April 2nd at 8:15 AM, seat 7A (window). No change fee since the fare is the same. Your updated confirmation is FLT-AX7829-M1. Anything else?\n\n[User] Nope, you're great. Thanks!",
"final_outcome": "Flight successfully changed to April 2nd. Same time, window seat, no additional charge."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/tau-bench-agent-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
AgentBoard Progress Scoring
Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.
OSWorld: Desktop Agent Task Evaluation
Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.