Why pairwise comparison for agent evaluation

Asking someone to rate a coding agent trace on a scale of 1 to 10 gives you noisy data, because everyone calibrates that scale differently. One annotator's 7 is another's 5. Pairwise comparison gets around this. Instead of rating traces on their own, annotators look at two side by side and say which one is better. That kind of head-to-head judgment is easier to make, more consistent across people, and happens to be exactly what you need for Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF).

It is the same approach used to train reward models for language model alignment, and it carries over to coding agents cleanly: collect human preferences between pairs of agent trajectories, train a reward model on them, then use that model to guide agent training or pick the best of N candidates at inference time.

Potato has three pairwise comparison modes, each suited to a different evaluation need and data budget.

The interface puts two traces side by side:

Side-by-side agent comparison interface Annotators compare two agent traces and select which approach was better

Mode 1: binary preference

This is the simplest and fastest mode. The annotator sees two traces side by side and clicks the better one. An optional tie button covers the cases where both are equally good or equally bad.

When to use binary mode

Reach for binary mode when you need a lot of preference data in a hurry. It is a good fit for training basic reward models, working out agent win rates, and building Elo leaderboards. The downside is that you lose nuance. You learn which trace won, but not by how much or on which fronts.

Configuration

yaml

# config.yaml
project_name: "Agent Comparison - Binary"
port: 8000
 
data:
  source: "local"
  input_path: "./data/paired_traces.jsonl"
  data_format: "paired_coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    syntax_highlighting: true
    terminal_theme: "dark"
    file_tree:
      enabled: true
      position: "left"
    collapsible:
      auto_collapse_thinking: true
 
comparison:
  layout: "side_by_side"         # "side_by_side" or "tabbed"
  label_a: "Agent A"
  label_b: "Agent B"
  randomize_order: true          # Randomize which trace appears on which side
  show_agent_identity: false     # Hide agent names to avoid bias
  sync_scroll: false             # Independent scrolling for each trace
 
annotation_schemes:
  - annotation_type: pairwise_comparison
    name: preference
    mode: "binary"
    question: "Which agent produced a better solution?"
    options:
      - value: "a"
        text: "Agent A is better"
        keyboard_shortcut: "1"
      - value: "b"
        text: "Agent B is better"
        keyboard_shortcut: "2"
      - value: "tie"
        text: "Tie (equally good or equally bad)"
        keyboard_shortcut: "3"
    allow_tie: true
    require_justification: false  # No text explanation needed
 
  - annotation_type: radio
    name: confidence
    label: "How confident are you?"
    options:
      - value: "high"
        text: "Very confident"
      - value: "medium"
        text: "Somewhat confident"
      - value: "low"
        text: "Not confident"
 
output:
  path: "./output/"
  format: "jsonl"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 20
  attention_checks:
    enabled: true
    frequency: 10                # Insert a check every 10 instances
    type: "duplicate_reversed"   # Re-show a pair with A/B swapped
 
annotators:
  - username: "judge1"
    password: "judge_pw_1"
  - username: "judge2"
    password: "judge_pw_2"

The annotation workflow

The annotator gets a split screen. On the left, Trace A renders with the full CodingTraceDisplay: diffs, terminal blocks, file reads, thinking. On the right, Trace B for the same task. Each side scrolls on its own.

The task description sits above both traces so the annotator knows what the two agents were trying to do.

Below them are three buttons: "Agent A is better", "Agent B is better", and "Tie." With randomize_order on, which agent is A and which is B gets shuffled per instance, so annotators cannot fall into a left-side or right-side habit.

For finer-grained evaluation, the interface also supports multiple dimensions:

Pairwise preference selection interface Binary preference, continuous scale, and multi-dimension modes are available

Mode 2: continuous scale

Scale mode lets the annotator say how much better one trace is, not just which one won. Instead of a single click, they drag a slider that runs from "A much better" on the left to "B much better" on the right, with "Equal" in the middle.

When to use scale mode

Use scale mode when the strength of a preference matters, not just its direction. A slider near the extreme means a clear quality gap; near the center means the two were close. DPO and similar pipelines can weight examples by that strength, leaning harder on the clear-cut calls.

Configuration

yaml

# config.yaml
project_name: "Agent Comparison - Scale"
port: 8000
 
data:
  source: "local"
  input_path: "./data/paired_traces.jsonl"
  data_format: "paired_coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    syntax_highlighting: true
    terminal_theme: "dark"
    file_tree:
      enabled: true
    collapsible:
      auto_collapse_thinking: true
 
comparison:
  layout: "side_by_side"
  randomize_order: true
  show_agent_identity: false
 
annotation_schemes:
  - annotation_type: pairwise_comparison
    name: preference_scale
    mode: "scale"
    question: "Which agent produced a better solution, and by how much?"
    scale:
      points: 7                  # 7-point scale
      labels:
        1: "A is much better"
        2: "A is better"
        3: "A is slightly better"
        4: "Equal"
        5: "B is slightly better"
        6: "B is better"
        7: "B is much better"
      default: 4                 # Start at "Equal"
      show_numeric_value: true
    require_justification: true
    justification_label: "Briefly explain your rating"
    justification_min_length: 20
 
output:
  path: "./output/"
  format: "jsonl"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 20
 
annotators:
  - username: "judge1"
    password: "judge_pw_1"
  - username: "judge2"
    password: "judge_pw_2"

Using a 5-point scale

For faster annotation and a bit less granularity, drop to a 5-point scale:

yaml

annotation_schemes:
  - annotation_type: pairwise_comparison
    name: preference_scale_5
    mode: "scale"
    question: "Compare the two solutions"
    scale:
      points: 5
      labels:
        1: "A is clearly better"
        2: "A is somewhat better"
        3: "About equal"
        4: "B is somewhat better"
        5: "B is clearly better"
      default: 3

Mode 3: multi-dimension comparison

This is the most detailed mode. Rather than one overall preference, the annotator judges each trace on several independent dimensions. Every dimension gets its own A/B/Tie call, and every call needs a written justification.

When to use multi-dimension mode

Use this when you want to know not just which agent won, but why. One trace might have correct code and terrible efficiency; another might be efficient but blow past an edge case. The per-dimension data that comes out of this can train dimension-specific reward models or feed detailed notes back to the people building the agent.

Configuration

yaml

# config.yaml
project_name: "Agent Comparison - Multi-Dimension"
port: 8000
 
data:
  source: "local"
  input_path: "./data/paired_traces.jsonl"
  data_format: "paired_coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    syntax_highlighting: true
    terminal_theme: "dark"
    file_tree:
      enabled: true
    collapsible:
      auto_collapse_thinking: true
 
comparison:
  layout: "side_by_side"
  randomize_order: true
  show_agent_identity: false
 
annotation_schemes:
  - annotation_type: pairwise_comparison
    name: multi_dim_comparison
    mode: "multi_dimension"
    question: "Compare the two solutions along each dimension"
    dimensions:
      - name: "correctness"
        label: "Correctness"
        description: >
          Does the solution correctly fix the issue? Are there remaining
          bugs, missed edge cases, or incorrect logic?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Why is this solution more correct?"
        weight: 0.4               # Weight for computing overall preference
 
      - name: "efficiency"
        label: "Efficiency"
        description: >
          How efficient is the agent's process? Does it take unnecessary
          steps, read irrelevant files, or make redundant edits?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which agent was more efficient and why?"
        weight: 0.2
 
      - name: "code_quality"
        label: "Code Quality"
        description: >
          Is the code well-written? Consider readability, naming,
          error handling, documentation, and adherence to existing patterns.
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which produces better quality code?"
        weight: 0.2
 
      - name: "communication"
        label: "Communication"
        description: >
          How well does the agent explain its reasoning? Are its thinking
          steps clear and logical? Does it identify the root cause?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which agent communicates its approach better?"
        weight: 0.1
 
      - name: "robustness"
        label: "Robustness"
        description: >
          Does the solution handle edge cases? Does the agent verify its
          changes with tests? Is the fix narrow and targeted or fragile?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which solution is more robust?"
        weight: 0.1
 
    overall_preference:
      enabled: true              # Also ask for overall preference
      question: "Overall, which solution do you prefer?"
      options: ["A", "B", "Tie"]
      require_justification: true
 
output:
  path: "./output/"
  format: "jsonl"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 25         # Higher overlap for this detailed task
  minimum_time_per_instance: 120 # 2 minutes minimum for thorough review
 
annotators:
  - username: "judge1"
    password: "judge_pw_1"
  - username: "judge2"
    password: "judge_pw_2"

Preparing paired trace data

All three modes take paired traces as input. Each line in the JSONL file holds two traces that attempted the same task.

Data format

json

{
  "id": "pair_001",
  "task_description": "Fix the IndexError in process_batch() when the input list is empty",
  "repo": "myorg/myproject",
  "trace_a": {
    "agent": "claude_code",
    "model": "claude-sonnet-4-20250514",
    "structured_turns": [
      {
        "step_idx": 0,
        "type": "file_read",
        "path": "src/batch.py",
        "content": "def process_batch(items):\n    result = items[0]\n    ...",
        "start_line": 10,
        "end_line": 25
      },
      {
        "step_idx": 1,
        "type": "file_edit",
        "path": "src/batch.py",
        "diff": "--- a/src/batch.py\n+++ b/src/batch.py\n@@ -10,3 +10,5 @@\n def process_batch(items):\n+    if not items:\n+        return []\n     result = items[0]\n"
      },
      {
        "step_idx": 2,
        "type": "bash_command",
        "command": "python -m pytest tests/test_batch.py -v",
        "output": "PASSED",
        "exit_code": 0
      }
    ]
  },
  "trace_b": {
    "agent": "swe_agent",
    "model": "gpt-4o",
    "structured_turns": [
      {
        "step_idx": 0,
        "type": "bash_command",
        "command": "find . -name '*.py' | xargs grep 'process_batch'",
        "output": "src/batch.py:def process_batch(items):\ntests/test_batch.py:    process_batch([])",
        "exit_code": 0
      },
      {
        "step_idx": 1,
        "type": "file_read",
        "path": "src/batch.py",
        "content": "def process_batch(items):\n    result = items[0]\n    ...",
        "start_line": 1,
        "end_line": 50
      },
      {
        "step_idx": 2,
        "type": "file_edit",
        "path": "src/batch.py",
        "diff": "--- a/src/batch.py\n+++ b/src/batch.py\n@@ -10,3 +10,6 @@\n def process_batch(items):\n+    if items is None or len(items) == 0:\n+        logger.warning('Empty input to process_batch')\n+        return []\n     result = items[0]\n"
      },
      {
        "step_idx": 3,
        "type": "bash_command",
        "command": "python -m pytest tests/ -v",
        "output": "PASSED (12 tests)",
        "exit_code": 0
      }
    ]
  }
}

Building pairs from individual traces

If you have individual traces that all took on the same tasks, the pairing utility will assemble them:

bash

# Generate all possible pairs for each task
potato pair-traces \
  --input ./data/individual_traces.jsonl \
  --output ./data/paired_traces.jsonl \
  --pair_by "task_id" \
  --strategy "all_pairs"
 
# Or sample a fixed number of pairs per task
potato pair-traces \
  --input ./data/individual_traces.jsonl \
  --output ./data/paired_traces.jsonl \
  --pair_by "task_id" \
  --strategy "sample" \
  --pairs_per_task 3

Exporting comparison data

DPO/RLHF preference pairs

The main export format for pairwise comparisons is preference pairs for DPO or RLHF training:

bash

potato export \
  --format dpo_preferences \
  --project ./output/ \
  --output ./training_data/preferences.jsonl

For binary mode, the output is simple:

json

{
  "prompt": "Fix the IndexError in process_batch() when the input list is empty",
  "chosen": {"agent": "claude_code", "trace_id": "trace_a_001", "steps": [...]},
  "rejected": {"agent": "swe_agent", "trace_id": "trace_b_001", "steps": [...]},
  "annotator": "judge1",
  "confidence": "high"
}

Scale mode adds the preference strength:

json

{
  "prompt": "Fix the IndexError in process_batch()",
  "chosen": {"agent": "claude_code", "trace_id": "trace_a_001"},
  "rejected": {"agent": "swe_agent", "trace_id": "trace_b_001"},
  "preference_strength": 0.83,
  "scale_value": 2,
  "justification": "Agent A found and fixed the bug in fewer steps with cleaner code"
}

Multi-dimension mode carries the per-dimension preferences:

json

{
  "prompt": "Fix the IndexError in process_batch()",
  "chosen": {"agent": "claude_code", "trace_id": "trace_a_001"},
  "rejected": {"agent": "swe_agent", "trace_id": "trace_b_001"},
  "overall_preference": "A",
  "dimensions": {
    "correctness": {"preference": "Tie", "justification": "Both correctly fix the bug"},
    "efficiency": {"preference": "A", "justification": "A solves it in 3 steps vs 4"},
    "code_quality": {"preference": "B", "justification": "B adds logging and handles None"},
    "communication": {"preference": "A", "justification": "A's reasoning is more focused"},
    "robustness": {"preference": "B", "justification": "B runs full test suite, not just one file"}
  },
  "weighted_score_a": 0.55,
  "weighted_score_b": 0.45
}

Analysis: win rates, Elo ratings, and per-dimension breakdowns

Computing win rates

python

import json
from collections import defaultdict
 
with open("training_data/preferences.jsonl") as f:
    prefs = [json.loads(line) for line in f]
 
wins = defaultdict(lambda: {"wins": 0, "losses": 0, "ties": 0})
 
for pref in prefs:
    agent_chosen = pref["chosen"]["agent"]
    agent_rejected = pref["rejected"]["agent"]
 
    if agent_chosen == agent_rejected:
        continue  # Skip self-comparisons
 
    if pref.get("overall_preference") == "Tie":
        wins[agent_chosen]["ties"] += 1
        wins[agent_rejected]["ties"] += 1
    else:
        wins[agent_chosen]["wins"] += 1
        wins[agent_rejected]["losses"] += 1
 
print("Agent Win Rates:")
print("-" * 55)
for agent, record in sorted(wins.items()):
    total = record["wins"] + record["losses"] + record["ties"]
    win_rate = (record["wins"] + 0.5 * record["ties"]) / total * 100
    print(f"  {agent:<20} {win_rate:5.1f}%  "
          f"(W:{record['wins']} L:{record['losses']} T:{record['ties']})")

Computing Elo ratings

python

import json
import math
from collections import defaultdict
 
def compute_elo(preferences, k=32, initial_rating=1500):
    """Compute Elo ratings from pairwise preferences."""
    ratings = defaultdict(lambda: initial_rating)
 
    for pref in preferences:
        agent_a = pref["chosen"]["agent"]
        agent_b = pref["rejected"]["agent"]
 
        ra = ratings[agent_a]
        rb = ratings[agent_b]
 
        # Expected scores
        ea = 1.0 / (1.0 + math.pow(10, (rb - ra) / 400))
        eb = 1.0 / (1.0 + math.pow(10, (ra - rb) / 400))
 
        overall = pref.get("overall_preference", "A")
        if overall == "Tie":
            sa, sb = 0.5, 0.5
        else:
            # "chosen" is the winner
            sa, sb = 1.0, 0.0
 
        ratings[agent_a] = ra + k * (sa - ea)
        ratings[agent_b] = rb + k * (sb - eb)
 
    return dict(ratings)
 
with open("training_data/preferences.jsonl") as f:
    prefs = [json.loads(line) for line in f]
 
ratings = compute_elo(prefs)
 
print("Elo Ratings:")
print("-" * 35)
for agent, rating in sorted(ratings.items(), key=lambda x: -x[1]):
    print(f"  {agent:<20} {rating:.0f}")

Per-dimension breakdowns

For multi-dimension comparisons, look at which dimensions each agent does well on:

python

import json
from collections import defaultdict
 
with open("training_data/preferences.jsonl") as f:
    prefs = [json.loads(line) for line in f]
 
# Only process multi-dimension annotations
multi_dim = [p for p in prefs if "dimensions" in p]
 
dim_wins = defaultdict(lambda: defaultdict(lambda: {"A": 0, "B": 0, "Tie": 0}))
 
for pref in multi_dim:
    agent_a = pref["chosen"]["agent"]
    agent_b = pref["rejected"]["agent"]
    pair_key = f"{agent_a} vs {agent_b}"
 
    for dim_name, dim_data in pref["dimensions"].items():
        dim_wins[dim_name][pair_key][dim_data["preference"]] += 1
 
print("Per-Dimension Win Rates:")
print("=" * 60)
for dim_name, matchups in sorted(dim_wins.items()):
    print(f"\n  {dim_name.upper()}")
    print(f"  {'-' * 50}")
    for pair, counts in matchups.items():
        total = counts["A"] + counts["B"] + counts["Tie"]
        a_rate = (counts["A"] + 0.5 * counts["Tie"]) / total * 100
        print(f"    {pair}: A={a_rate:.0f}% B={100-a_rate:.0f}%  "
              f"(A:{counts['A']} B:{counts['B']} Tie:{counts['Tie']})")

What works in practice

Picking a mode

Binary mode is the right call when you want thousands of preferences quickly, a general-purpose reward model, or a leaderboard ranking. Plan on roughly 1 to 2 minutes per comparison.

Scale mode earns its keep when preference strength feeds your training pipeline. DPO with margin weighting cares about the difference between a strong preference (slider at the extreme) and a weak one (slider near the center). Plan on 2 to 3 minutes per comparison.

Multi-dimension mode is worth the extra time when you need to know where agents are strong and weak, when you are training dimension-specific reward models, or when you owe agent developers a detailed report. Plan on 4 to 6 minutes per comparison.

How many comparisons you need

For reliable win rates, collect at least 100 comparisons per agent pair. For Elo ratings across five or more agents, 200 to 300 total comparisons settle into stable rankings. For DPO reward models, aim for 1,000 or more preference pairs that span easy and hard tasks alike.

Randomizing the order

Always set randomize_order: true. Position bias, the tendency to prefer whichever trace shows up on the left or in the first tab, is well documented in human evaluation studies. Pair the randomization with the attention_checks.type: "duplicate_reversed" check to catch anyone who just keeps clicking the same side.

Handling ties

In binary mode, allow ties but watch the tie rate. If it climbs past 30%, the agents are probably too close for a binary call and you should move to scale or multi-dimension mode. In scale mode, ties are just the center point. In multi-dimension mode, ties on individual dimensions are expected and tell you something.

Hiding agent identity

Keep show_agent_identity: false unless you have a real reason to show it. If annotators know which agent produced a trace, they tend to favor whichever one they already expect to be stronger.

Combining modes

For a thorough evaluation, run binary mode first over a large pool of pairs to get overall rankings, then run multi-dimension mode on a smaller, stratified subset for the diagnostic detail. The binary comparisons feed reward model training; the multi-dimension ones tell you where to focus agent improvements.

For the configuration reference behind these modes, see the source documentation. For a broader walkthrough of evaluating agents end to end, the agent evaluation guide is the place to start.