Diese Seite ist in Ihrer Sprache noch nicht verfügbar. Englische Version wird angezeigt.

Coding-Agent Evaluation

How to evaluate coding agents, reviewing diffs, terminal output, and SWE-bench/Aider/Claude Code traces, with Potato's coding trace display.

A coding agent edits files, runs commands, and reads output to solve a programming task. Evaluating one is like reviewing a pull request that also includes the terminal session: you judge the code changes and the steps that produced them. Annotate SWE-agent, Aider, and Claude Code trajectories with Potato, a free self-hosted tool that renders diffs and terminal blocks for review.

This pairs with automated benchmarks such as SWE-bench; human review catches the plausible-but-wrong patch that passes a weak test.

What does the annotator review in a coding-agent run?

Diffs: color-coded unified diffs of each file change, with line numbers and a file-tree sidebar.
Commands and output: terminal blocks showing what the agent ran and what came back.
Reasoning: the agent's thoughts between actions.

Potato reads coding-agent traces including SWE-bench, Aider, and Claude Code formats. See Coding Agent Annotation and Code Review Annotation.

What should I judge in a coding-agent run?

Correctness: does the change solve the task without breaking other things?
Step quality: was each edit/command sensible, or flailing?
Efficiency: did it take a reasonable path?

yaml

annotation_schemes:
  - annotation_type: per_turn_rating
    name: step_correctness
    description: "Judge each edit or command."
    target: agentic_steps
    rating_type: radio
    labels: ["Correct", "Partially correct", "Incorrect", "Unnecessary"]
  - annotation_type: radio
    name: overall
    description: "Does the final change solve the task?"
    labels: [Solved, Partially solved, Not solved]

How do I keep coding-agent evaluation reliable?

Give annotators the task description and the repository context; a diff is meaningless without the goal.
For first-error reasoning chains, see Process Reward Models.
To watch an agent code in real time rather than review a recording, see Live Agent Evaluation.

Coding-Agent Evaluation

What does the annotator review in a coding-agent run?

What should I judge in a coding-agent run?

How do I keep coding-agent evaluation reliable?

Further reading