Questa pagina non è ancora disponibile nella tua lingua. Viene mostrata la versione in inglese.

Evaluating Tool Use and Function Calling

How to annotate and evaluate an agent's tool calls and function calling across trace formats (OpenAI, Anthropic, ReAct, LangChain) with Potato per-turn ratings.

When an agent calls tools, search, a calculator, an API, a database, each call is a decision you can evaluate: was this the right tool? were the arguments correct? was the result used properly? Tool-use evaluation turns those decisions into per-step labels over an agent's trace.

This is the human-judgment complement to automated function calling benchmarks: a call can be syntactically valid yet wrong for the task.

What to judge at each tool call

Tool selection: was this the appropriate tool, or should it have used another (or none)?
Arguments: were the parameters correct and complete?
Necessity: was the call needed, or redundant?
Result handling: did the agent correctly interpret and use the output?

Reading traces from any framework

Potato converts 13 trace formats into a common step view, so you can evaluate tool use regardless of how the agent was built: OpenAI and Anthropic tool/function calls, ReAct thought-action-observation traces, LangChain, LangFuse, and more. See Agentic Annotation.

Per-step rating setup

Attach a rating to each step (each tool call) with a conditional follow-up for the failures:

yaml

annotation_schemes:
  - annotation_type: per_turn_rating
    name: tool_call_correctness
    description: "For each tool call, judge whether it was the right call."
    target: agentic_steps
    rating_type: radio
    labels: ["Correct", "Wrong tool", "Wrong arguments", "Unnecessary"]
  - annotation_type: text
    name: notes
    description: "If not correct, what should it have done?"
    label_requirement:
      required: false

Quality considerations

Show the tool's output, not just the call, annotators can't judge result handling otherwise.
Pretty-print JSON arguments and responses so they're readable (Potato does this in the agent trace display).
Distinguish "wrong tool" from "right tool, wrong arguments", they point to different model fixes.

Evaluating Tool Use and Function Calling

What to judge at each tool call

Reading traces from any framework

Per-step rating setup

Quality considerations

Further reading