Coding-Agent Evaluation
How to evaluate coding agents, reviewing diffs, terminal output, and SWE-bench/Aider/Claude Code traces, with Potato's coding trace display.
A coding agent edits files, runs commands, and reads output to solve a programming task. Evaluating one is like reviewing a pull request that also includes the terminal session: you judge the code changes and the steps that produced them. Potato renders unified diffs and terminal blocks so annotators can review a coding run the way they'd review a PR.
This pairs with automated benchmarks such as SWE-bench; human review catches the plausible-but-wrong patch that passes a weak test.
What the annotator reviews
- Diffs: color-coded unified diffs of each file change, with line numbers and a file-tree sidebar.
- Commands and output: terminal blocks showing what the agent ran and what came back.
- Reasoning: the agent's thoughts between actions.
Potato reads coding-agent traces including SWE-bench, Aider, and Claude Code formats. See Coding Agent Annotation and Code Review Annotation.
What to judge
- Correctness: does the change solve the task without breaking other things?
- Step quality: was each edit/command sensible, or flailing?
- Efficiency: did it take a reasonable path?
annotation_schemes:
- annotation_type: per_turn_rating
name: step_correctness
description: "Judge each edit or command."
target: agentic_steps
rating_type: radio
labels: ["Correct", "Partially correct", "Incorrect", "Unnecessary"]
- annotation_type: radio
name: overall
description: "Does the final change solve the task?"
labels: [Solved, Partially solved, Not solved]Quality considerations
- Give annotators the task description and the repository context; a diff is meaningless without the goal.
- For first-error reasoning chains, see Process Reward Models.
- To watch an agent code in real time rather than review a recording, see Live Agent Evaluation.