Blog/Tutorials
Tutorials6 min read

Image Comparison and Preference Tasks

Build side-by-side image comparison interfaces for preference ranking, A/B testing, and quality assessment.

By Potato Teamยท

Image Comparison and Preference Tasks

Image comparison is essential for training generative models, evaluating image quality, and understanding human preferences. This tutorial covers pairwise comparison, ranking, and A/B testing setups.

Use Cases

  • Generative AI: RLHF for image generation models
  • Image quality: Comparing compression, enhancement, or restoration
  • Design testing: A/B testing visual designs
  • Search ranking: Evaluating image retrieval results

Basic Pairwise Comparison

annotation_task_name: "Image Preference"
 
data_files:
  - data/pairs.json
 
item_properties:
  id_key: pair_id
  image_a_key: image_left
  image_b_key: image_right
 
image:
  enabled: true
  layout: side_by_side
  display_size: medium
  enable_zoom: true
  sync_zoom: true  # Zoom both images together
 
annotation_schemes:
  - annotation_type: radio
    name: preference
    description: "Which image do you prefer?"
    labels:
      - Left is much better
      - Left is slightly better
      - About the same
      - Right is slightly better
      - Right is much better
    layout: horizontal

Data Format

{
  "pair_id": "pair_001",
  "image_left": "/images/model_a_output.png",
  "image_right": "/images/model_b_output.png",
  "prompt": "A sunset over mountains"
}

Enhanced Comparison Interface

annotation_task_name: "AI Image Generation Evaluation"
 
data_files:
  - data/generation_pairs.json
 
item_properties:
  id_key: id
  image_a_key: image_a
  image_b_key: image_b
  context_key: prompt
 
# Show the generation prompt
display:
  show_context: true
  context_label: "Generation Prompt"
  context_field: prompt
 
image:
  enabled: true
  layout: side_by_side
  gap: 20  # Pixels between images
  labels:
    left: "Image A"
    right: "Image B"
 
  # Interaction
  enable_zoom: true
  sync_zoom: true
  enable_pan: true
  sync_pan: true
 
  # Display
  max_height: 500
  background: "#1F2937"
  border_radius: 8
 
annotation_schemes:
  # Overall preference
  - annotation_type: radio
    name: overall_preference
    description: "Overall, which image is better?"
    labels:
      - name: A much better
        keyboard_shortcut: "1"
      - name: A slightly better
        keyboard_shortcut: "2"
      - name: Tie
        keyboard_shortcut: "3"
      - name: B slightly better
        keyboard_shortcut: "4"
      - name: B much better
        keyboard_shortcut: "5"
    required: true
 
  # Specific criteria
  - annotation_type: radio
    name: prompt_adherence
    description: "Which better matches the prompt?"
    labels: [A, Tie, B]
 
  - annotation_type: radio
    name: visual_quality
    description: "Which has better visual quality (no artifacts)?"
    labels: [A, Tie, B]
 
  - annotation_type: radio
    name: aesthetic_appeal
    description: "Which is more aesthetically pleasing?"
    labels: [A, Tie, B]
 
  - annotation_type: radio
    name: realism
    description: "Which looks more realistic?"
    labels: [A, Tie, B, N/A (neither should be realistic)]
 
  # Issues detection
  - annotation_type: multiselect
    name: issues_a
    description: "Issues in Image A (select all)"
    labels:
      - Distorted faces/hands
      - Text rendering issues
      - Unnatural lighting
      - Missing elements from prompt
      - Extra unwanted elements
      - Blurry or low quality
      - Color issues
      - None
 
  - annotation_type: multiselect
    name: issues_b
    description: "Issues in Image B (select all)"
    labels:
      - Distorted faces/hands
      - Text rendering issues
      - Unnatural lighting
      - Missing elements from prompt
      - Extra unwanted elements
      - Blurry or low quality
      - Color issues
      - None

Before/After Comparison

For image enhancement, restoration, or editing:

annotation_task_name: "Image Enhancement Evaluation"
 
data_files:
  - data/enhancements.json
 
item_properties:
  id_key: id
  image_a_key: original
  image_b_key: enhanced
 
image:
  layout: side_by_side
  labels:
    left: "Original"
    right: "Enhanced"
 
  # Slider comparison
  comparison_mode: slider  # Drag slider to reveal
  slider_position: 50  # Start at middle
 
annotation_schemes:
  - annotation_type: radio
    name: enhancement_quality
    description: "How well was the image enhanced?"
    labels:
      - Significantly improved
      - Slightly improved
      - No noticeable change
      - Made worse
 
  - annotation_type: multiselect
    name: improvements
    description: "What was improved?"
    labels:
      - Sharpness/detail
      - Color accuracy
      - Noise reduction
      - Dynamic range
      - Artifact removal
      - Nothing
 
  - annotation_type: multiselect
    name: problems_introduced
    description: "Any problems introduced?"
    labels:
      - Over-sharpening/halos
      - Color shift
      - Loss of detail
      - New artifacts
      - Unnatural look
      - None

Ranking Multiple Images

For ranking more than 2 images:

annotation_task_name: "Image Ranking"
 
data_files:
  - data/image_sets.json
 
item_properties:
  id_key: id
  image_list_key: images  # Array of image paths
 
image:
  layout: grid
  columns: 3
  enable_zoom: true
 
annotation_schemes:
  - annotation_type: ranking
    name: preference_rank
    description: "Rank images from best (1) to worst"
    source: images
    allow_ties: false
 
  - annotation_type: radio
    name: best_for_use
    description: "Which would you use for this purpose?"
    dynamic_labels_from: images

Data format:

{
  "id": "set_001",
  "prompt": "A cat sitting on a windowsill",
  "images": [
    "/images/set001_a.png",
    "/images/set001_b.png",
    "/images/set001_c.png",
    "/images/set001_d.png"
  ]
}

Best-Worst Scaling

Efficient ranking through repeated best-worst choices:

annotation_schemes:
  - annotation_type: best_worst
    name: preference
    description: "Select the BEST and WORST images"
    source: images
    best_label: "Best"
    worst_label: "Worst"
    neither_allowed: false

A/B Testing for Design

annotation_task_name: "Design A/B Test"
 
data_files:
  - data/design_variants.json
 
item_properties:
  id_key: id
  image_a_key: variant_a
  image_b_key: variant_b
  context_key: design_context
 
display:
  show_context: true
  context_label: "Design Context"
 
image:
  layout: side_by_side
  labels:
    left: "Design A"
    right: "Design B"
  randomize_order: true  # Prevent position bias
 
annotation_schemes:
  - annotation_type: radio
    name: preference
    description: "Which design do you prefer?"
    labels: [A, No preference, B]
    randomize_with_images: true  # Labels follow image randomization
 
  - annotation_type: likert
    name: a_appeal
    description: "Rate Design A's visual appeal"
    size: 7
    min_label: "Very unappealing"
    max_label: "Very appealing"
 
  - annotation_type: likert
    name: b_appeal
    description: "Rate Design B's visual appeal"
    size: 7
    min_label: "Very unappealing"
    max_label: "Very appealing"
 
  - annotation_type: text
    name: reasoning
    description: "Why did you choose this preference?"
    textarea: true
    required: false

Complete Configuration

annotation_task_name: "Generative Model Comparison - RLHF Data"
 
data_files:
  - data/model_outputs.json
 
item_properties:
  id_key: id
  image_a_key: model_a_output
  image_b_key: model_b_output
  context_key: prompt
 
display:
  show_context: true
  context_label: "Generation Prompt"
  context_style: "highlighted"
 
image:
  enabled: true
  layout: side_by_side
  gap: 24
  labels:
    left: "Output A"
    right: "Output B"
 
  max_height: 512
  enable_zoom: true
  sync_zoom: true
  enable_pan: true
  sync_pan: true
 
  background: "#111827"
  border: "1px solid #374151"
  border_radius: 8
 
  # Prevent position bias
  randomize_order: true
 
annotation_schemes:
  - annotation_type: radio
    name: overall
    description: "Which image better represents the prompt?"
    labels:
      - name: A is clearly better
        value: 2
        keyboard_shortcut: "1"
      - name: A is slightly better
        value: 1
        keyboard_shortcut: "2"
      - name: About equal
        value: 0
        keyboard_shortcut: "3"
      - name: B is slightly better
        value: -1
        keyboard_shortcut: "4"
      - name: B is clearly better
        value: -2
        keyboard_shortcut: "5"
    required: true
    preserve_with_randomization: true  # Values adjust for randomized order
 
  - annotation_type: likert
    name: confidence
    description: "How confident are you?"
    size: 5
    min_label: "Guessing"
    max_label: "Certain"
 
annotation_guidelines:
  title: "Image Comparison Guidelines"
  content: |
    ## Evaluation Criteria
    Consider these factors:
    1. **Prompt adherence**: Does it match what was asked?
    2. **Visual quality**: Are there artifacts or distortions?
    3. **Aesthetics**: Is it visually pleasing?
    4. **Realism** (if applicable): Does it look natural?
 
    ## Tips
    - Zoom in to check for details and artifacts
    - Consider the prompt carefully
    - Don't let one factor dominate unfairly
 
quality_control:
  attention_checks:
    frequency: 15
    gold_pairs:
      - image_a: "/gold/clearly_better.png"
        image_b: "/gold/clearly_worse.png"
        expected_preference: ["A is clearly better", "A is slightly better"]
 
output_annotation_dir: annotations/
output_annotation_format: jsonl

Output Format

{
  "pair_id": "pair_001",
  "prompt": "A sunset over mountains",
  "image_a": "/images/model_a_output.png",
  "image_b": "/images/model_b_output.png",
  "display_order": ["B", "A"],  // B was shown on left
  "annotations": {
    "overall": 1,  // A slightly better (adjusted for display order)
    "confidence": 4
  },
  "annotator": "rater_01",
  "timestamp": "2024-12-25T14:30:00Z"
}

Tips for Comparison Tasks

  1. Randomize order: Prevent left/right position bias
  2. Sync controls: Linked zoom/pan helps fair comparison
  3. Clear criteria: Define what "better" means
  4. Attention checks: Include obvious pairs
  5. Time limits: Consider time per comparison for consistency

Next Steps


Full comparison documentation at /docs/annotation-types/pairwise-comparison.