Skip to content
Guides13 min read

Comparing AI Agents Side by Side: Binary, Scale, and Multi-Dimension Modes

Set up pairwise agent comparison in Potato with three modes: binary preference, continuous scale, and per-dimension multi-criteria judgment with required justification.

Potato Team·
Questa pagina non è ancora disponibile nella tua lingua. Viene mostrata la versione in inglese.

Why Pairwise Comparison for Agent Evaluation

Absolute quality judgments are hard. Asking an annotator "rate this coding agent trace on a scale of 1-10" produces noisy, inconsistent data because different annotators calibrate their scales differently. Pairwise comparison sidesteps this problem: instead of rating traces in isolation, annotators see two traces side by side and judge which one is better. This comparative judgment is more natural, more consistent, and produces exactly the kind of preference data needed for training with Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF).

Pairwise comparison is the standard approach used to train reward models for language model alignment. The same methodology applies to coding agents: collect human preferences between pairs of agent trajectories, train a reward model on these preferences, and use the reward model to guide agent training or best-of-N selection at inference time.

Potato supports three pairwise comparison modes, each designed for different evaluation needs and data budgets.

Mode 1: Binary Preference

Binary preference is the simplest and fastest comparison mode. The annotator sees two traces side by side and clicks to select which one is better. An optional tie option handles cases where both traces are equally good (or equally bad).

When to Use Binary Mode

Binary mode is best when you need large volumes of preference data quickly. It works well for training basic reward models, computing agent win rates, and building Elo rating leaderboards. The tradeoff is that it loses nuance: you know which trace is preferred but not by how much or along which dimensions.

Configuration

yaml
# config.yaml
project_name: "Agent Comparison - Binary"
port: 8000
 
data:
  source: "local"
  input_path: "./data/paired_traces.jsonl"
  data_format: "paired_coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    syntax_highlighting: true
    terminal_theme: "dark"
    file_tree:
      enabled: true
      position: "left"
    collapsible:
      auto_collapse_thinking: true
 
comparison:
  layout: "side_by_side"         # "side_by_side" or "tabbed"
  label_a: "Agent A"
  label_b: "Agent B"
  randomize_order: true          # Randomize which trace appears on which side
  show_agent_identity: false     # Hide agent names to avoid bias
  sync_scroll: false             # Independent scrolling for each trace
 
annotation_schemes:
  - annotation_type: pairwise_comparison
    name: preference
    mode: "binary"
    question: "Which agent produced a better solution?"
    options:
      - value: "a"
        text: "Agent A is better"
        keyboard_shortcut: "1"
      - value: "b"
        text: "Agent B is better"
        keyboard_shortcut: "2"
      - value: "tie"
        text: "Tie (equally good or equally bad)"
        keyboard_shortcut: "3"
    allow_tie: true
    require_justification: false  # No text explanation needed
 
  - annotation_type: radio
    name: confidence
    label: "How confident are you?"
    options:
      - value: "high"
        text: "Very confident"
      - value: "medium"
        text: "Somewhat confident"
      - value: "low"
        text: "Not confident"
 
output:
  path: "./output/"
  format: "jsonl"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 20
  attention_checks:
    enabled: true
    frequency: 10                # Insert a check every 10 instances
    type: "duplicate_reversed"   # Re-show a pair with A/B swapped
 
annotators:
  - username: "judge1"
    password: "judge_pw_1"
  - username: "judge2"
    password: "judge_pw_2"

The Annotation Workflow

The annotator sees a split-screen view. The left panel shows Trace A rendered with the full CodingTraceDisplay (diffs, terminal blocks, file reads, thinking). The right panel shows Trace B for the same task. Both traces have independent scroll positions.

Above the traces, the task description is shown so the annotator understands what both agents were trying to accomplish.

Below the traces, the annotator clicks one of three buttons: "Agent A is better", "Agent B is better", or "Tie." Because randomize_order is enabled, the actual identity of which agent is A vs B is shuffled per instance, preventing position bias.

Mode 2: Continuous Scale

Scale mode adds nuance by letting the annotator express how much better one trace is compared to the other. Instead of a binary click, the annotator uses a slider that ranges from "A much better" on the left to "B much better" on the right, with "Equal" at the center.

When to Use Scale Mode

Scale mode is valuable when you need preference strength information, not just direction. Strong preferences (slider near the extremes) indicate clear quality differences, while weak preferences (slider near the center) indicate marginal differences. Training pipelines like DPO can weight examples by preference strength, giving more influence to clear-cut comparisons.

Configuration

yaml
# config.yaml
project_name: "Agent Comparison - Scale"
port: 8000
 
data:
  source: "local"
  input_path: "./data/paired_traces.jsonl"
  data_format: "paired_coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    syntax_highlighting: true
    terminal_theme: "dark"
    file_tree:
      enabled: true
    collapsible:
      auto_collapse_thinking: true
 
comparison:
  layout: "side_by_side"
  randomize_order: true
  show_agent_identity: false
 
annotation_schemes:
  - annotation_type: pairwise_comparison
    name: preference_scale
    mode: "scale"
    question: "Which agent produced a better solution, and by how much?"
    scale:
      points: 7                  # 7-point scale
      labels:
        1: "A is much better"
        2: "A is better"
        3: "A is slightly better"
        4: "Equal"
        5: "B is slightly better"
        6: "B is better"
        7: "B is much better"
      default: 4                 # Start at "Equal"
      show_numeric_value: true
    require_justification: true
    justification_label: "Briefly explain your rating"
    justification_min_length: 20
 
output:
  path: "./output/"
  format: "jsonl"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 20
 
annotators:
  - username: "judge1"
    password: "judge_pw_1"
  - username: "judge2"
    password: "judge_pw_2"

Using a 5-Point Scale

For faster annotation with slightly less granularity, use a 5-point scale:

yaml
annotation_schemes:
  - annotation_type: pairwise_comparison
    name: preference_scale_5
    mode: "scale"
    question: "Compare the two solutions"
    scale:
      points: 5
      labels:
        1: "A is clearly better"
        2: "A is somewhat better"
        3: "About equal"
        4: "B is somewhat better"
        5: "B is clearly better"
      default: 3

Mode 3: Multi-Dimension Comparison

Multi-dimension mode is the most detailed comparison method. Instead of a single overall preference, the annotator evaluates each trace along multiple independent dimensions. Each dimension gets its own A/B/Tie judgment, and each judgment requires a text justification.

When to Use Multi-Dimension Mode

Multi-dimension mode is ideal when you need to understand not just which agent is better overall, but why. A trace might have correct code but terrible efficiency. Another might be efficient but miss edge cases. Multi-dimension comparisons produce per-dimension preference data that can train dimension-specific reward models or provide detailed feedback for targeted agent improvement.

Configuration

yaml
# config.yaml
project_name: "Agent Comparison - Multi-Dimension"
port: 8000
 
data:
  source: "local"
  input_path: "./data/paired_traces.jsonl"
  data_format: "paired_coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    syntax_highlighting: true
    terminal_theme: "dark"
    file_tree:
      enabled: true
    collapsible:
      auto_collapse_thinking: true
 
comparison:
  layout: "side_by_side"
  randomize_order: true
  show_agent_identity: false
 
annotation_schemes:
  - annotation_type: pairwise_comparison
    name: multi_dim_comparison
    mode: "multi_dimension"
    question: "Compare the two solutions along each dimension"
    dimensions:
      - name: "correctness"
        label: "Correctness"
        description: >
          Does the solution correctly fix the issue? Are there remaining
          bugs, missed edge cases, or incorrect logic?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Why is this solution more correct?"
        weight: 0.4               # Weight for computing overall preference
 
      - name: "efficiency"
        label: "Efficiency"
        description: >
          How efficient is the agent's process? Does it take unnecessary
          steps, read irrelevant files, or make redundant edits?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which agent was more efficient and why?"
        weight: 0.2
 
      - name: "code_quality"
        label: "Code Quality"
        description: >
          Is the code well-written? Consider readability, naming,
          error handling, documentation, and adherence to existing patterns.
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which produces better quality code?"
        weight: 0.2
 
      - name: "communication"
        label: "Communication"
        description: >
          How well does the agent explain its reasoning? Are its thinking
          steps clear and logical? Does it identify the root cause?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which agent communicates its approach better?"
        weight: 0.1
 
      - name: "robustness"
        label: "Robustness"
        description: >
          Does the solution handle edge cases? Does the agent verify its
          changes with tests? Is the fix narrow and targeted or fragile?
        options: ["A", "B", "Tie"]
        require_justification: true
        justification_placeholder: "Which solution is more robust?"
        weight: 0.1
 
    overall_preference:
      enabled: true              # Also ask for overall preference
      question: "Overall, which solution do you prefer?"
      options: ["A", "B", "Tie"]
      require_justification: true
 
output:
  path: "./output/"
  format: "jsonl"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 25         # Higher overlap for this detailed task
  minimum_time_per_instance: 120 # 2 minutes minimum for thorough review
 
annotators:
  - username: "judge1"
    password: "judge_pw_1"
  - username: "judge2"
    password: "judge_pw_2"

Preparing Paired Trace Data

All three comparison modes require paired traces as input. Each instance in the JSONL file contains two traces that attempted the same task.

Data Format

json
{
  "id": "pair_001",
  "task_description": "Fix the IndexError in process_batch() when the input list is empty",
  "repo": "myorg/myproject",
  "trace_a": {
    "agent": "claude_code",
    "model": "claude-sonnet-4-20250514",
    "structured_turns": [
      {
        "step_idx": 0,
        "type": "file_read",
        "path": "src/batch.py",
        "content": "def process_batch(items):\n    result = items[0]\n    ...",
        "start_line": 10,
        "end_line": 25
      },
      {
        "step_idx": 1,
        "type": "file_edit",
        "path": "src/batch.py",
        "diff": "--- a/src/batch.py\n+++ b/src/batch.py\n@@ -10,3 +10,5 @@\n def process_batch(items):\n+    if not items:\n+        return []\n     result = items[0]\n"
      },
      {
        "step_idx": 2,
        "type": "bash_command",
        "command": "python -m pytest tests/test_batch.py -v",
        "output": "PASSED",
        "exit_code": 0
      }
    ]
  },
  "trace_b": {
    "agent": "swe_agent",
    "model": "gpt-4o",
    "structured_turns": [
      {
        "step_idx": 0,
        "type": "bash_command",
        "command": "find . -name '*.py' | xargs grep 'process_batch'",
        "output": "src/batch.py:def process_batch(items):\ntests/test_batch.py:    process_batch([])",
        "exit_code": 0
      },
      {
        "step_idx": 1,
        "type": "file_read",
        "path": "src/batch.py",
        "content": "def process_batch(items):\n    result = items[0]\n    ...",
        "start_line": 1,
        "end_line": 50
      },
      {
        "step_idx": 2,
        "type": "file_edit",
        "path": "src/batch.py",
        "diff": "--- a/src/batch.py\n+++ b/src/batch.py\n@@ -10,3 +10,6 @@\n def process_batch(items):\n+    if items is None or len(items) == 0:\n+        logger.warning('Empty input to process_batch')\n+        return []\n     result = items[0]\n"
      },
      {
        "step_idx": 3,
        "type": "bash_command",
        "command": "python -m pytest tests/ -v",
        "output": "PASSED (12 tests)",
        "exit_code": 0
      }
    ]
  }
}

Generating Pairs from Individual Traces

If you have individual traces that attempted the same tasks, use the pairing utility:

bash
# Generate all possible pairs for each task
potato pair-traces \
  --input ./data/individual_traces.jsonl \
  --output ./data/paired_traces.jsonl \
  --pair_by "task_id" \
  --strategy "all_pairs"
 
# Or sample a fixed number of pairs per task
potato pair-traces \
  --input ./data/individual_traces.jsonl \
  --output ./data/paired_traces.jsonl \
  --pair_by "task_id" \
  --strategy "sample" \
  --pairs_per_task 3

Exporting Comparison Data

DPO/RLHF Preference Pairs

The primary export format for pairwise comparisons is preference pairs for DPO or RLHF training:

bash
potato export \
  --format dpo_preferences \
  --project ./output/ \
  --output ./training_data/preferences.jsonl

For binary mode, the output is straightforward:

json
{
  "prompt": "Fix the IndexError in process_batch() when the input list is empty",
  "chosen": {"agent": "claude_code", "trace_id": "trace_a_001", "steps": [...]},
  "rejected": {"agent": "swe_agent", "trace_id": "trace_b_001", "steps": [...]},
  "annotator": "judge1",
  "confidence": "high"
}

For scale mode, the preference strength is included:

json
{
  "prompt": "Fix the IndexError in process_batch()",
  "chosen": {"agent": "claude_code", "trace_id": "trace_a_001"},
  "rejected": {"agent": "swe_agent", "trace_id": "trace_b_001"},
  "preference_strength": 0.83,
  "scale_value": 2,
  "justification": "Agent A found and fixed the bug in fewer steps with cleaner code"
}

For multi-dimension mode, per-dimension preferences are included:

json
{
  "prompt": "Fix the IndexError in process_batch()",
  "chosen": {"agent": "claude_code", "trace_id": "trace_a_001"},
  "rejected": {"agent": "swe_agent", "trace_id": "trace_b_001"},
  "overall_preference": "A",
  "dimensions": {
    "correctness": {"preference": "Tie", "justification": "Both correctly fix the bug"},
    "efficiency": {"preference": "A", "justification": "A solves it in 3 steps vs 4"},
    "code_quality": {"preference": "B", "justification": "B adds logging and handles None"},
    "communication": {"preference": "A", "justification": "A's reasoning is more focused"},
    "robustness": {"preference": "B", "justification": "B runs full test suite, not just one file"}
  },
  "weighted_score_a": 0.55,
  "weighted_score_b": 0.45
}

Analysis: Win Rates, Elo Ratings, and Per-Dimension Breakdowns

Computing Win Rates

python
import json
from collections import defaultdict
 
with open("training_data/preferences.jsonl") as f:
    prefs = [json.loads(line) for line in f]
 
wins = defaultdict(lambda: {"wins": 0, "losses": 0, "ties": 0})
 
for pref in prefs:
    agent_chosen = pref["chosen"]["agent"]
    agent_rejected = pref["rejected"]["agent"]
 
    if agent_chosen == agent_rejected:
        continue  # Skip self-comparisons
 
    if pref.get("overall_preference") == "Tie":
        wins[agent_chosen]["ties"] += 1
        wins[agent_rejected]["ties"] += 1
    else:
        wins[agent_chosen]["wins"] += 1
        wins[agent_rejected]["losses"] += 1
 
print("Agent Win Rates:")
print("-" * 55)
for agent, record in sorted(wins.items()):
    total = record["wins"] + record["losses"] + record["ties"]
    win_rate = (record["wins"] + 0.5 * record["ties"]) / total * 100
    print(f"  {agent:<20} {win_rate:5.1f}%  "
          f"(W:{record['wins']} L:{record['losses']} T:{record['ties']})")

Computing Elo Ratings

python
import json
import math
from collections import defaultdict
 
def compute_elo(preferences, k=32, initial_rating=1500):
    """Compute Elo ratings from pairwise preferences."""
    ratings = defaultdict(lambda: initial_rating)
 
    for pref in preferences:
        agent_a = pref["chosen"]["agent"]
        agent_b = pref["rejected"]["agent"]
 
        ra = ratings[agent_a]
        rb = ratings[agent_b]
 
        # Expected scores
        ea = 1.0 / (1.0 + math.pow(10, (rb - ra) / 400))
        eb = 1.0 / (1.0 + math.pow(10, (ra - rb) / 400))
 
        overall = pref.get("overall_preference", "A")
        if overall == "Tie":
            sa, sb = 0.5, 0.5
        else:
            # "chosen" is the winner
            sa, sb = 1.0, 0.0
 
        ratings[agent_a] = ra + k * (sa - ea)
        ratings[agent_b] = rb + k * (sb - eb)
 
    return dict(ratings)
 
with open("training_data/preferences.jsonl") as f:
    prefs = [json.loads(line) for line in f]
 
ratings = compute_elo(prefs)
 
print("Elo Ratings:")
print("-" * 35)
for agent, rating in sorted(ratings.items(), key=lambda x: -x[1]):
    print(f"  {agent:<20} {rating:.0f}")

Per-Dimension Breakdowns

For multi-dimension comparisons, analyze which dimensions each agent excels at:

python
import json
from collections import defaultdict
 
with open("training_data/preferences.jsonl") as f:
    prefs = [json.loads(line) for line in f]
 
# Only process multi-dimension annotations
multi_dim = [p for p in prefs if "dimensions" in p]
 
dim_wins = defaultdict(lambda: defaultdict(lambda: {"A": 0, "B": 0, "Tie": 0}))
 
for pref in multi_dim:
    agent_a = pref["chosen"]["agent"]
    agent_b = pref["rejected"]["agent"]
    pair_key = f"{agent_a} vs {agent_b}"
 
    for dim_name, dim_data in pref["dimensions"].items():
        dim_wins[dim_name][pair_key][dim_data["preference"]] += 1
 
print("Per-Dimension Win Rates:")
print("=" * 60)
for dim_name, matchups in sorted(dim_wins.items()):
    print(f"\n  {dim_name.upper()}")
    print(f"  {'-' * 50}")
    for pair, counts in matchups.items():
        total = counts["A"] + counts["B"] + counts["Tie"]
        a_rate = (counts["A"] + 0.5 * counts["Tie"]) / total * 100
        print(f"    {pair}: A={a_rate:.0f}% B={100-a_rate:.0f}%  "
              f"(A:{counts['A']} B:{counts['B']} Tie:{counts['Tie']})")

Best Practices

When to Use Each Mode

Binary mode is the right choice when you need to collect thousands of preferences quickly, when you are building a general-purpose reward model, or when computing agent leaderboard rankings. Budget approximately 1-2 minutes per comparison.

Scale mode adds value when preference strength matters for your training pipeline. DPO with margin weighting benefits from knowing that some preferences are strong (slider at the extreme) while others are weak (slider near the center). Budget approximately 2-3 minutes per comparison.

Multi-dimension mode is worth the extra time when you need diagnostic information about agent strengths and weaknesses, when you want to train dimension-specific reward models, or when producing detailed evaluation reports for agent developers. Budget approximately 4-6 minutes per comparison.

How Many Comparisons Do You Need

For reliable win rates, collect at least 100 pairwise comparisons per agent pair. For Elo ratings with 5+ agents, 200-300 total comparisons produces stable rankings. For training DPO reward models, aim for 1,000+ preference pairs covering the full distribution of task difficulties.

Randomizing Presentation Order

Always enable randomize_order: true in your comparison config. Position bias (preferring whichever trace appears on the left or first tab) is well-documented in human evaluation studies. Randomization with the attention_checks.type: "duplicate_reversed" quality check catches annotators who always select the same side.

Handling Ties

In binary mode, allow ties but track the tie rate. A tie rate above 30% suggests the agents are too similar for binary comparison and you should switch to scale or multi-dimension mode. In scale mode, ties are natural (the center point). In multi-dimension mode, ties per dimension are expected and informative.

Hiding Agent Identity

Always set show_agent_identity: false unless you specifically need annotators to know which agent produced each trace. Knowledge of agent identity introduces bias: annotators may favor traces from agents they expect to be stronger.

Combining Modes

For the most comprehensive evaluation, run binary mode first on a large set of pairs for overall rankings, then run multi-dimension mode on a smaller, stratified subset for diagnostic analysis. The binary comparisons feed your reward model training pipeline while the multi-dimension comparisons inform targeted agent improvement.