Evaluating Tool Use and Function Calling
How to annotate and evaluate an agent's tool calls and function calling across trace formats (OpenAI, Anthropic, ReAct, LangChain) with Potato per-turn ratings.
When an agent calls tools, search, a calculator, an API, a database, each call is a decision you can evaluate: was this the right tool? were the arguments correct? was the result used properly? Tool-use evaluation turns those decisions into per-step labels over an agent's trace.
This is the human-judgment complement to automated function calling benchmarks: a call can be syntactically valid yet wrong for the task.
What to judge at each tool call
- Tool selection: was this the appropriate tool, or should it have used another (or none)?
- Arguments: were the parameters correct and complete?
- Necessity: was the call needed, or redundant?
- Result handling: did the agent correctly interpret and use the output?
Reading traces from any framework
Potato converts 13 trace formats into a common step view, so you can evaluate tool use regardless of how the agent was built: OpenAI and Anthropic tool/function calls, ReAct thought-action-observation traces, LangChain, LangFuse, and more. See Agentic Annotation.
Per-step rating setup
Attach a rating to each step (each tool call) with a conditional follow-up for the failures:
annotation_schemes:
- annotation_type: per_turn_rating
name: tool_call_correctness
description: "For each tool call, judge whether it was the right call."
target: agentic_steps
rating_type: radio
labels: ["Correct", "Wrong tool", "Wrong arguments", "Unnecessary"]
- annotation_type: text
name: notes
description: "If not correct, what should it have done?"
label_requirement:
required: falseQuality considerations
- Show the tool's output, not just the call, annotators can't judge result handling otherwise.
- Pretty-print JSON arguments and responses so they're readable (Potato does this in the agent trace display).
- Distinguish "wrong tool" from "right tool, wrong arguments", they point to different model fixes.