Skip to content
このページはまだお使いの言語に翻訳されていません。英語版を表示しています。

Coding-Agent Evaluation

How to evaluate coding agents, reviewing diffs, terminal output, and SWE-bench/Aider/Claude Code traces, with Potato's coding trace display.

A coding agent edits files, runs commands, and reads output to solve a programming task. Evaluating one is like reviewing a pull request that also includes the terminal session: you judge the code changes and the steps that produced them. Potato renders unified diffs and terminal blocks so annotators can review a coding run the way they'd review a PR.

This pairs with automated benchmarks such as SWE-bench; human review catches the plausible-but-wrong patch that passes a weak test.

What the annotator reviews

  • Diffs: color-coded unified diffs of each file change, with line numbers and a file-tree sidebar.
  • Commands and output: terminal blocks showing what the agent ran and what came back.
  • Reasoning: the agent's thoughts between actions.

Potato reads coding-agent traces including SWE-bench, Aider, and Claude Code formats. See Coding Agent Annotation and Code Review Annotation.

What to judge

  • Correctness: does the change solve the task without breaking other things?
  • Step quality: was each edit/command sensible, or flailing?
  • Efficiency: did it take a reasonable path?
yaml
annotation_schemes:
  - annotation_type: per_turn_rating
    name: step_correctness
    description: "Judge each edit or command."
    target: agentic_steps
    rating_type: radio
    labels: ["Correct", "Partially correct", "Incorrect", "Unnecessary"]
  - annotation_type: radio
    name: overall
    description: "Does the final change solve the task?"
    labels: [Solved, Partially solved, Not solved]

Quality considerations

  • Give annotators the task description and the repository context; a diff is meaningless without the goal.
  • For first-error reasoning chains, see Process Reward Models.
  • To watch an agent code in real time rather than review a recording, see Live Agent Evaluation.

Further reading