Skip to content
Showcase/tau-bench Agent Evaluation
advancedsurvey

tau-bench Agent Evaluation

Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# tau-bench Agent Evaluation
# Based on "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains" (Yao et al., ICLR 2025)
# Task: Evaluate tool-agent-user interactions in customer service domains

annotation_task_name: "tau-bench Agent Evaluation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
    <div style="display: flex; gap: 12px; margin-bottom: 14px;">
      <span style="background: #2c3e50; color: #fff; padding: 4px 12px; border-radius: 12px; font-size: 13px;">{{domain}}</span>
    </div>
    <div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 8px 0; color: #1a5276;">Task Specification</h3>
      <p style="margin: 0; font-size: 15px;">{{text}}</p>
    </div>
    <div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 10px 0; color: #2c3e50;">Conversation</h3>
      <div style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{conversation}}</div>
    </div>
    <div style="background: #f0f0f0; border: 1px solid #999; padding: 14px; border-radius: 8px;">
      <h4 style="margin: 0 0 8px 0; color: #333;">Final Outcome</h4>
      <p style="margin: 0; font-size: 14px;">{{final_outcome}}</p>
    </div>
  </div>

annotation_schemes:
  - name: task_success
    description: "Did the agent successfully complete the customer's request?"
    annotation_type: radio
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    keyboard_shortcuts:
      "Success": "1"
      "Partial Success": "2"
      "Failure": "3"

  - name: conversation_quality
    description: "Rate the overall quality of the agent's conversation with the user."
    annotation_type: likert
    min_label: "Very Poor"
    max_label: "Excellent"
    size: 5

  - name: tool_use_issues
    description: "Select all tool use issues observed in the conversation."
    annotation_type: multiselect
    labels:
      - "Tool Call Error"
      - "Wrong Arguments"
      - "Missing Tool Call"
      - "Redundant Tool Call"
      - "Correct Tool Use"

  - name: eval_rationale
    description: "Explain your evaluation: what did the agent do well or poorly?"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "tau-001",
    "text": "Customer wants to return a defective blender purchased 5 days ago and get a full refund to their original payment method.",
    "domain": "Retail",
    "conversation": "[User] Hi, I bought a blender last week and it stopped working after two uses. I'd like to return it.\n\n[Agent] I'm sorry to hear that! I'd be happy to help you with a return. Could you provide your order number?\n\n[User] Sure, it's ORD-2024-88431.\n\n[Tool Call] lookup_order(order_id=\"ORD-2024-88431\")\n[Tool Result] {\"order_id\": \"ORD-2024-88431\", \"product\": \"BlendMax 3000\", \"price\": 79.99, \"date\": \"2026-03-24\", \"status\": \"delivered\", \"payment\": \"visa_ending_4521\"}\n\n[Agent] I found your order for the BlendMax 3000 ($79.99), purchased on March 24th. Since it's within our 30-day return window, I can process a return for you. Would you like a refund to your Visa ending in 4521?\n\n[User] Yes, please. Full refund.\n\n[Tool Call] initiate_return(order_id=\"ORD-2024-88431\", reason=\"defective\", refund_method=\"original_payment\")\n[Tool Result] {\"return_id\": \"RET-55012\", \"status\": \"approved\", \"refund_amount\": 79.99, \"estimated_refund\": \"3-5 business days\"}\n\n[Agent] Your return has been approved! Return ID is RET-55012. You'll receive a $79.99 refund to your Visa ending in 4521 within 3-5 business days. You'll also receive a prepaid shipping label via email. Is there anything else I can help with?\n\n[User] No, that's all. Thanks!",
    "final_outcome": "Return initiated successfully. Refund of $79.99 approved to original payment method. Prepaid label sent."
  },
  {
    "id": "tau-002",
    "text": "Customer wants to change their flight from NYC to LA from March 30 to April 2, keeping the same flight time and seat preference.",
    "domain": "Airline",
    "conversation": "[User] I need to change my flight. Booking reference is FLT-AX7829.\n\n[Agent] Of course! Let me pull up your booking.\n\n[Tool Call] get_booking(reference=\"FLT-AX7829\")\n[Tool Result] {\"reference\": \"FLT-AX7829\", \"route\": \"JFK-LAX\", \"date\": \"2026-03-30\", \"time\": \"08:15\", \"seat\": \"14A\", \"class\": \"economy\", \"passenger\": \"Jordan Rivera\", \"fare\": 289.00}\n\n[Agent] I see your booking, Jordan. You're currently on the JFK to LAX flight on March 30 at 8:15 AM, seat 14A in economy. What date would you like to change to?\n\n[User] April 2nd, same time if possible. And I'd like to keep my window seat.\n\n[Tool Call] search_flights(route=\"JFK-LAX\", date=\"2026-04-02\", time_preference=\"08:00-09:00\", class=\"economy\")\n[Tool Result] {\"flights\": [{\"flight\": \"AX-201\", \"time\": \"08:15\", \"seats_available\": [\"7A\", \"7F\", \"22A\"], \"fare\": 289.00}, {\"flight\": \"AX-205\", \"time\": \"08:45\", \"seats_available\": [\"3C\", \"15F\"], \"fare\": 315.00}]}\n\n[Agent] Great news! There's a flight on April 2nd at 8:15 AM (same time) with window seats available. I can put you in seat 7A. The fare is the same at $289, so no additional charge. Shall I proceed?\n\n[User] Yes, that works perfectly.\n\n[Tool Call] modify_booking(reference=\"FLT-AX7829\", new_flight=\"AX-201\", new_date=\"2026-04-02\", new_seat=\"7A\")\n[Tool Result] {\"status\": \"confirmed\", \"new_reference\": \"FLT-AX7829-M1\", \"change_fee\": 0.00}\n\n[Agent] All set! Your flight has been changed to April 2nd at 8:15 AM, seat 7A (window). No change fee since the fare is the same. Your updated confirmation is FLT-AX7829-M1. Anything else?\n\n[User] Nope, you're great. Thanks!",
    "final_outcome": "Flight successfully changed to April 2nd. Same time, window seat, no additional charge."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/tau-bench-agent-eval
potato start config.yaml

Details

Annotation Types

radiolikertmultiselecttext

Domain

Agentic AICustomer ServiceTool Use

Use Cases

Agent EvaluationConversation Quality

Tags

tool-usecustomer-serviceuser-interactionretailairline

Found an issue or want to improve this design?

Open an Issue