# Evaluating Tool Use and Function Calling

Source: https://www.potatoannotator.com/docs/guides/tool-use-evaluation

**When an agent calls tools, search, a calculator, an API, a database, each call is a decision you can evaluate: was this the right tool? were the arguments correct? was the result used properly?** Tool-use evaluation turns those decisions into per-step labels over an agent's trace.

This is the human-judgment complement to automated [function calling](https://en.wikipedia.org/wiki/Function_(computer_programming)) benchmarks: a call can be syntactically valid yet wrong for the task.

## What to judge at each tool call

- **Tool selection**: was this the appropriate tool, or should it have used another (or none)?
- **Arguments**: were the parameters correct and complete?
- **Necessity**: was the call needed, or redundant?
- **Result handling**: did the agent correctly interpret and use the output?

## Reading traces from any framework

Potato converts 13 trace formats into a common step view, so you can evaluate tool use regardless of how the agent was built: OpenAI and Anthropic tool/function calls, [ReAct](https://arxiv.org/abs/2210.03629) thought-action-observation traces, LangChain, LangFuse, and more. See [Agentic Annotation](/docs/features/agentic-annotation).

## Per-step rating setup

Attach a rating to each step (each tool call) with a conditional follow-up for the failures:

```yaml
annotation_schemes:
  - annotation_type: per_turn_rating
    name: tool_call_correctness
    description: "For each tool call, judge whether it was the right call."
    target: agentic_steps
    rating_type: radio
    labels: ["Correct", "Wrong tool", "Wrong arguments", "Unnecessary"]
  - annotation_type: text
    name: notes
    description: "If not correct, what should it have done?"
    label_requirement:
      required: false
```

## Quality considerations

- Show the tool's *output*, not just the call, annotators can't judge result handling otherwise.
- Pretty-print JSON arguments and responses so they're readable (Potato does this in the agent trace display).
- Distinguish "wrong tool" from "right tool, wrong arguments", they point to different model fixes.

## Further reading

- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents)
- [Annotating Agent Trajectories](/docs/guides/agent-trajectory-annotation)
- [Agentic Annotation feature reference](/docs/features/agentic-annotation)