# Coding-Agent Evaluation

Source: https://www.potatoannotator.com/docs/guides/coding-agent-evaluation

**A coding agent edits files, runs commands, and reads output to solve a programming task. Evaluating one is like reviewing a pull request that also includes the terminal session: you judge the code changes and the steps that produced them.** Potato renders unified diffs and terminal blocks so annotators can review a coding run the way they'd review a PR.

This pairs with automated benchmarks such as [SWE-bench](https://www.swebench.com/); human review catches the plausible-but-wrong patch that passes a weak test.

## What the annotator reviews

- **Diffs**: color-coded unified diffs of each file change, with line numbers and a file-tree sidebar.
- **Commands and output**: terminal blocks showing what the agent ran and what came back.
- **Reasoning**: the agent's thoughts between actions.

Potato reads coding-agent traces including SWE-bench, [Aider](https://aider.chat/), and Claude Code formats. See [Coding Agent Annotation](/docs/features/coding-agent-annotation) and [Code Review Annotation](/docs/features/code-review-annotation).

## What to judge

- **Correctness**: does the change solve the task without breaking other things?
- **Step quality**: was each edit/command sensible, or flailing?
- **Efficiency**: did it take a reasonable path?

```yaml
annotation_schemes:
  - annotation_type: per_turn_rating
    name: step_correctness
    description: "Judge each edit or command."
    target: agentic_steps
    rating_type: radio
    labels: ["Correct", "Partially correct", "Incorrect", "Unnecessary"]
  - annotation_type: radio
    name: overall
    description: "Does the final change solve the task?"
    labels: [Solved, Partially solved, Not solved]
```

## Quality considerations

- Give annotators the task description and the repository context; a diff is meaningless without the goal.
- For first-error reasoning chains, see [Process Reward Models](/docs/guides/process-reward-models).
- To watch an agent code in real time rather than review a recording, see [Live Agent Evaluation](/docs/guides/live-agent-evaluation).

## Further reading

- [Coding Agent Annotation feature reference](/docs/features/coding-agent-annotation)
- [Code Review Annotation](/docs/features/code-review-annotation)
- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents)
