# Image Comparison and Preference Tasks

Source: https://www.potatoannotator.com/blog/image-comparison-preference

A lot of image work comes down to one question: which of these is better? It shows up when you collect preference data for a generative model, when you compare two compression settings, when you run an A/B test on a design, or when you check whether a search returned the right images. People are far better at "A or B?" than at scoring a single image in isolation, which is what makes comparison tasks worth setting up properly. This tutorial covers pairwise comparison, ranking, and A/B testing.

## Basic pairwise comparison

The pairwise comparison interface presents items side by side with preference selection controls:

![Pairwise comparison interface in Potato](/images/blog/pairwise-comparison.png)

```yaml
annotation_task_name: "Image Preference"

data_files:
  - data/pairs.json

item_properties:
  id_key: pair_id
  image_a_key: image_left
  image_b_key: image_right

image:
  enabled: true
  layout: side_by_side
  display_size: medium
  enable_zoom: true
  sync_zoom: true  # Zoom both images together

annotation_schemes:
  - annotation_type: radio
    name: preference
    description: "Which image do you prefer?"
    labels:
      - Left is much better
      - Left is slightly better
      - About the same
      - Right is slightly better
      - Right is much better
    layout: horizontal
```

## Data Format

```json
{
  "pair_id": "pair_001",
  "image_left": "/images/model_a_output.png",
  "image_right": "/images/model_b_output.png",
  "prompt": "A sunset over mountains"
}
```

## Enhanced Comparison Interface

```yaml
annotation_task_name: "AI Image Generation Evaluation"

data_files:
  - data/generation_pairs.json

item_properties:
  id_key: id
  image_a_key: image_a
  image_b_key: image_b
  context_key: prompt

# Show the generation prompt
display:
  show_context: true
  context_label: "Generation Prompt"
  context_field: prompt

image:
  enabled: true
  layout: side_by_side
  gap: 20  # Pixels between images
  labels:
    left: "Image A"
    right: "Image B"

  # Interaction
  enable_zoom: true
  sync_zoom: true
  enable_pan: true
  sync_pan: true

  # Display
  max_height: 500
  background: "#1F2937"
  border_radius: 8

annotation_schemes:
  # Overall preference
  - annotation_type: radio
    name: overall_preference
    description: "Overall, which image is better?"
    labels:
      - name: A much better
        keyboard_shortcut: "1"
      - name: A slightly better
        keyboard_shortcut: "2"
      - name: Tie
        keyboard_shortcut: "3"
      - name: B slightly better
        keyboard_shortcut: "4"
      - name: B much better
        keyboard_shortcut: "5"
    required: true

  # Specific criteria
  - annotation_type: radio
    name: prompt_adherence
    description: "Which better matches the prompt?"
    labels: [A, Tie, B]

  - annotation_type: radio
    name: visual_quality
    description: "Which has better visual quality (no artifacts)?"
    labels: [A, Tie, B]

  - annotation_type: radio
    name: aesthetic_appeal
    description: "Which is more aesthetically pleasing?"
    labels: [A, Tie, B]

  - annotation_type: radio
    name: realism
    description: "Which looks more realistic?"
    labels: [A, Tie, B, N/A (neither should be realistic)]

  # Issues detection
  - annotation_type: multiselect
    name: issues_a
    description: "Issues in Image A (select all)"
    labels:
      - Distorted faces/hands
      - Text rendering issues
      - Unnatural lighting
      - Missing elements from prompt
      - Extra unwanted elements
      - Blurry or low quality
      - Color issues
      - None

  - annotation_type: multiselect
    name: issues_b
    description: "Issues in Image B (select all)"
    labels:
      - Distorted faces/hands
      - Text rendering issues
      - Unnatural lighting
      - Missing elements from prompt
      - Extra unwanted elements
      - Blurry or low quality
      - Color issues
      - None
```

## Before/After Comparison

For image enhancement, restoration, or editing:

```yaml
annotation_task_name: "Image Enhancement Evaluation"

data_files:
  - data/enhancements.json

item_properties:
  id_key: id
  image_a_key: original
  image_b_key: enhanced

image:
  layout: side_by_side
  labels:
    left: "Original"
    right: "Enhanced"

  # Slider comparison
  comparison_mode: slider  # Drag slider to reveal
  slider_position: 50  # Start at middle

annotation_schemes:
  - annotation_type: radio
    name: enhancement_quality
    description: "How well was the image enhanced?"
    labels:
      - Significantly improved
      - Slightly improved
      - No noticeable change
      - Made worse

  - annotation_type: multiselect
    name: improvements
    description: "What was improved?"
    labels:
      - Sharpness/detail
      - Color accuracy
      - Noise reduction
      - Dynamic range
      - Artifact removal
      - Nothing

  - annotation_type: multiselect
    name: problems_introduced
    description: "Any problems introduced?"
    labels:
      - Over-sharpening/halos
      - Color shift
      - Loss of detail
      - New artifacts
      - Unnatural look
      - None
```

## Ranking Multiple Images

For ranking more than 2 images:

```yaml
annotation_task_name: "Image Ranking"

data_files:
  - data/image_sets.json

item_properties:
  id_key: id
  image_list_key: images  # Array of image paths

image:
  layout: grid
  columns: 3
  enable_zoom: true

annotation_schemes:
  - annotation_type: ranking
    name: preference_rank
    description: "Rank images from best (1) to worst"
    source: images
    allow_ties: false

  - annotation_type: radio
    name: best_for_use
    description: "Which would you use for this purpose?"
    dynamic_labels_from: images
```

Data format:

```json
{
  "id": "set_001",
  "prompt": "A cat sitting on a windowsill",
  "images": [
    "/images/set001_a.png",
    "/images/set001_b.png",
    "/images/set001_c.png",
    "/images/set001_d.png"
  ]
}
```

## Best-Worst Scaling

Efficient ranking through repeated best-worst choices:

```yaml
annotation_schemes:
  - annotation_type: best_worst
    name: preference
    description: "Select the BEST and WORST images"
    source: images
    best_label: "Best"
    worst_label: "Worst"
    neither_allowed: false
```

## A/B Testing for Design

```yaml
annotation_task_name: "Design A/B Test"

data_files:
  - data/design_variants.json

item_properties:
  id_key: id
  image_a_key: variant_a
  image_b_key: variant_b
  context_key: design_context

display:
  show_context: true
  context_label: "Design Context"

image:
  layout: side_by_side
  labels:
    left: "Design A"
    right: "Design B"
  randomize_order: true  # Prevent position bias

annotation_schemes:
  - annotation_type: radio
    name: preference
    description: "Which design do you prefer?"
    labels: [A, No preference, B]
    randomize_with_images: true  # Labels follow image randomization

  - annotation_type: likert
    name: a_appeal
    description: "Rate Design A's visual appeal"
    size: 7
    min_label: "Very unappealing"
    max_label: "Very appealing"

  - annotation_type: likert
    name: b_appeal
    description: "Rate Design B's visual appeal"
    size: 7
    min_label: "Very unappealing"
    max_label: "Very appealing"

  - annotation_type: text
    name: reasoning
    description: "Why did you choose this preference?"
    multiline: true
    required: false
```

## Complete Configuration

```yaml
annotation_task_name: "Generative Model Comparison - RLHF Data"

data_files:
  - data/model_outputs.json

item_properties:
  id_key: id
  image_a_key: model_a_output
  image_b_key: model_b_output
  context_key: prompt

display:
  show_context: true
  context_label: "Generation Prompt"
  context_style: "highlighted"

image:
  enabled: true
  layout: side_by_side
  gap: 24
  labels:
    left: "Output A"
    right: "Output B"

  max_height: 512
  enable_zoom: true
  sync_zoom: true
  enable_pan: true
  sync_pan: true

  background: "#111827"
  border: "1px solid #374151"
  border_radius: 8

  # Prevent position bias
  randomize_order: true

annotation_schemes:
  - annotation_type: radio
    name: overall
    description: "Which image better represents the prompt?"
    labels:
      - name: A is clearly better
        value: 2
        keyboard_shortcut: "1"
      - name: A is slightly better
        value: 1
        keyboard_shortcut: "2"
      - name: About equal
        value: 0
        keyboard_shortcut: "3"
      - name: B is slightly better
        value: -1
        keyboard_shortcut: "4"
      - name: B is clearly better
        value: -2
        keyboard_shortcut: "5"
    required: true
    preserve_with_randomization: true  # Values adjust for randomized order

  - annotation_type: likert
    name: confidence
    description: "How confident are you?"
    size: 5
    min_label: "Guessing"
    max_label: "Certain"

annotation_guidelines:
  title: "Image Comparison Guidelines"
  content: |
    ## Evaluation Criteria
    Consider these factors:
    1. **Prompt adherence**: Does it match what was asked?
    2. **Visual quality**: Are there artifacts or distortions?
    3. **Aesthetics**: Is it visually pleasing?
    4. **Realism** (if applicable): Does it look natural?

    ## Tips
    - Zoom in to check for details and artifacts
    - Consider the prompt carefully
    - Don't let one factor dominate unfairly

quality_control:
  attention_checks:
    frequency: 15
    gold_pairs:
      - image_a: "/gold/clearly_better.png"
        image_b: "/gold/clearly_worse.png"
        expected_preference: ["A is clearly better", "A is slightly better"]

output_annotation_dir: annotations/
export_annotation_format: jsonl
```

## Output Format

```json
{
  "pair_id": "pair_001",
  "prompt": "A sunset over mountains",
  "image_a": "/images/model_a_output.png",
  "image_b": "/images/model_b_output.png",
  "display_order": ["B", "A"],  // B was shown on left
  "annotations": {
    "overall": 1,  // A slightly better (adjusted for display order)
    "confidence": 4
  },
  "annotator": "rater_01",
  "timestamp": "2024-12-25T14:30:00Z"
}
```

## Tips for comparison tasks

People lean toward the left image more often than they should, so randomize which side each option lands on. Link zoom and pan across both images, otherwise annotators end up comparing a detail on one against the whole on the other. Say plainly what "better" means for this task, since it is not obvious. Slip in a few pairs with an obvious winner as attention checks. And watch the time per comparison: if it swings wildly, the ratings probably will too.

For implementation details, see the [pairwise annotation documentation](https://github.com/davidjurgens/potato/blob/master/docs/annotation-types/comparison/pairwise_annotation.md).

## Next steps

- Set up [crowdsourcing](/blog/prolific-integration) for large-scale preference data
- Learn about [ranking analysis](/blog/inter-annotator-agreement)
- Explore [pairwise comparison](/docs/annotation-types/pairwise-comparison) documentation

---

*Full comparison documentation at [/docs/annotation-types/pairwise-comparison](/docs/annotation-types/pairwise-comparison).*
