Image Comparison and Preference Tasks

Image comparison is essential for training generative models, evaluating image quality, and understanding human preferences. This tutorial covers pairwise comparison, ranking, and A/B testing setups.

Use Cases

Generative AI: RLHF for image generation models
Image quality: Comparing compression, enhancement, or restoration
Design testing: A/B testing visual designs
Search ranking: Evaluating image retrieval results

Basic Pairwise Comparison

yaml

annotation_task_name: "Image Preference"
 
data_files:
  - data/pairs.json
 
item_properties:
  id_key: pair_id
  image_a_key: image_left
  image_b_key: image_right
 
image:
  enabled: true
  layout: side_by_side
  display_size: medium
  enable_zoom: true
  sync_zoom: true  # Zoom both images together
 
annotation_schemes:
  - annotation_type: radio
    name: preference
    description: "Which image do you prefer?"
    labels:
      - Left is much better
      - Left is slightly better
      - About the same
      - Right is slightly better
      - Right is much better
    layout: horizontal

Data Format

json

{
  "pair_id": "pair_001",
  "image_left": "/images/model_a_output.png",
  "image_right": "/images/model_b_output.png",
  "prompt": "A sunset over mountains"
}

Enhanced Comparison Interface

yaml

annotation_task_name: "AI Image Generation Evaluation"
 
data_files:
  - data/generation_pairs.json
 
item_properties:
  id_key: id
  image_a_key: image_a
  image_b_key: image_b
  context_key: prompt
 
# Show the generation prompt
display:
  show_context: true
  context_label: "Generation Prompt"
  context_field: prompt
 
image:
  enabled: true
  layout: side_by_side
  gap: 20  # Pixels between images
  labels:
    left: "Image A"
    right: "Image B"
 
  # Interaction
  enable_zoom: true
  sync_zoom: true
  enable_pan: true
  sync_pan: true
 
  # Display
  max_height: 500
  background: "#1F2937"
  border_radius: 8
 
annotation_schemes:
  # Overall preference
  - annotation_type: radio
    name: overall_preference
    description: "Overall, which image is better?"
    labels:
      - name: A much better
        keyboard_shortcut: "1"
      - name: A slightly better
        keyboard_shortcut: "2"
      - name: Tie
        keyboard_shortcut: "3"
      - name: B slightly better
        keyboard_shortcut: "4"
      - name: B much better
        keyboard_shortcut: "5"
    required: true
 
  # Specific criteria
  - annotation_type: radio
    name: prompt_adherence
    description: "Which better matches the prompt?"
    labels: [A, Tie, B]
 
  - annotation_type: radio
    name: visual_quality
    description: "Which has better visual quality (no artifacts)?"
    labels: [A, Tie, B]
 
  - annotation_type: radio
    name: aesthetic_appeal
    description: "Which is more aesthetically pleasing?"
    labels: [A, Tie, B]
 
  - annotation_type: radio
    name: realism
    description: "Which looks more realistic?"
    labels: [A, Tie, B, N/A (neither should be realistic)]
 
  # Issues detection
  - annotation_type: multiselect
    name: issues_a
    description: "Issues in Image A (select all)"
    labels:
      - Distorted faces/hands
      - Text rendering issues
      - Unnatural lighting
      - Missing elements from prompt
      - Extra unwanted elements
      - Blurry or low quality
      - Color issues
      - None
 
  - annotation_type: multiselect
    name: issues_b
    description: "Issues in Image B (select all)"
    labels:
      - Distorted faces/hands
      - Text rendering issues
      - Unnatural lighting
      - Missing elements from prompt
      - Extra unwanted elements
      - Blurry or low quality
      - Color issues
      - None

Before/After Comparison

For image enhancement, restoration, or editing:

yaml

annotation_task_name: "Image Enhancement Evaluation"
 
data_files:
  - data/enhancements.json
 
item_properties:
  id_key: id
  image_a_key: original
  image_b_key: enhanced
 
image:
  layout: side_by_side
  labels:
    left: "Original"
    right: "Enhanced"
 
  # Slider comparison
  comparison_mode: slider  # Drag slider to reveal
  slider_position: 50  # Start at middle
 
annotation_schemes:
  - annotation_type: radio
    name: enhancement_quality
    description: "How well was the image enhanced?"
    labels:
      - Significantly improved
      - Slightly improved
      - No noticeable change
      - Made worse
 
  - annotation_type: multiselect
    name: improvements
    description: "What was improved?"
    labels:
      - Sharpness/detail
      - Color accuracy
      - Noise reduction
      - Dynamic range
      - Artifact removal
      - Nothing
 
  - annotation_type: multiselect
    name: problems_introduced
    description: "Any problems introduced?"
    labels:
      - Over-sharpening/halos
      - Color shift
      - Loss of detail
      - New artifacts
      - Unnatural look
      - None

Ranking Multiple Images

For ranking more than 2 images:

yaml

annotation_task_name: "Image Ranking"
 
data_files:
  - data/image_sets.json
 
item_properties:
  id_key: id
  image_list_key: images  # Array of image paths
 
image:
  layout: grid
  columns: 3
  enable_zoom: true
 
annotation_schemes:
  - annotation_type: ranking
    name: preference_rank
    description: "Rank images from best (1) to worst"
    source: images
    allow_ties: false
 
  - annotation_type: radio
    name: best_for_use
    description: "Which would you use for this purpose?"
    dynamic_labels_from: images

Data format:

json

{
  "id": "set_001",
  "prompt": "A cat sitting on a windowsill",
  "images": [
    "/images/set001_a.png",
    "/images/set001_b.png",
    "/images/set001_c.png",
    "/images/set001_d.png"
  ]
}

Best-Worst Scaling

Efficient ranking through repeated best-worst choices:

yaml

annotation_schemes:
  - annotation_type: best_worst
    name: preference
    description: "Select the BEST and WORST images"
    source: images
    best_label: "Best"
    worst_label: "Worst"
    neither_allowed: false

A/B Testing for Design

yaml

annotation_task_name: "Design A/B Test"
 
data_files:
  - data/design_variants.json
 
item_properties:
  id_key: id
  image_a_key: variant_a
  image_b_key: variant_b
  context_key: design_context
 
display:
  show_context: true
  context_label: "Design Context"
 
image:
  layout: side_by_side
  labels:
    left: "Design A"
    right: "Design B"
  randomize_order: true  # Prevent position bias
 
annotation_schemes:
  - annotation_type: radio
    name: preference
    description: "Which design do you prefer?"
    labels: [A, No preference, B]
    randomize_with_images: true  # Labels follow image randomization
 
  - annotation_type: likert
    name: a_appeal
    description: "Rate Design A's visual appeal"
    size: 7
    min_label: "Very unappealing"
    max_label: "Very appealing"
 
  - annotation_type: likert
    name: b_appeal
    description: "Rate Design B's visual appeal"
    size: 7
    min_label: "Very unappealing"
    max_label: "Very appealing"
 
  - annotation_type: text
    name: reasoning
    description: "Why did you choose this preference?"
    textarea: true
    required: false

Complete Configuration

yaml

annotation_task_name: "Generative Model Comparison - RLHF Data"
 
data_files:
  - data/model_outputs.json
 
item_properties:
  id_key: id
  image_a_key: model_a_output
  image_b_key: model_b_output
  context_key: prompt
 
display:
  show_context: true
  context_label: "Generation Prompt"
  context_style: "highlighted"
 
image:
  enabled: true
  layout: side_by_side
  gap: 24
  labels:
    left: "Output A"
    right: "Output B"
 
  max_height: 512
  enable_zoom: true
  sync_zoom: true
  enable_pan: true
  sync_pan: true
 
  background: "#111827"
  border: "1px solid #374151"
  border_radius: 8
 
  # Prevent position bias
  randomize_order: true
 
annotation_schemes:
  - annotation_type: radio
    name: overall
    description: "Which image better represents the prompt?"
    labels:
      - name: A is clearly better
        value: 2
        keyboard_shortcut: "1"
      - name: A is slightly better
        value: 1
        keyboard_shortcut: "2"
      - name: About equal
        value: 0
        keyboard_shortcut: "3"
      - name: B is slightly better
        value: -1
        keyboard_shortcut: "4"
      - name: B is clearly better
        value: -2
        keyboard_shortcut: "5"
    required: true
    preserve_with_randomization: true  # Values adjust for randomized order
 
  - annotation_type: likert
    name: confidence
    description: "How confident are you?"
    size: 5
    min_label: "Guessing"
    max_label: "Certain"
 
annotation_guidelines:
  title: "Image Comparison Guidelines"
  content: |
    ## Evaluation Criteria
    Consider these factors:
    1. **Prompt adherence**: Does it match what was asked?
    2. **Visual quality**: Are there artifacts or distortions?
    3. **Aesthetics**: Is it visually pleasing?
    4. **Realism** (if applicable): Does it look natural?
 
    ## Tips
    - Zoom in to check for details and artifacts
    - Consider the prompt carefully
    - Don't let one factor dominate unfairly
 
quality_control:
  attention_checks:
    frequency: 15
    gold_pairs:
      - image_a: "/gold/clearly_better.png"
        image_b: "/gold/clearly_worse.png"
        expected_preference: ["A is clearly better", "A is slightly better"]
 
output_annotation_dir: annotations/
output_annotation_format: jsonl

Output Format

json

{
  "pair_id": "pair_001",
  "prompt": "A sunset over mountains",
  "image_a": "/images/model_a_output.png",
  "image_b": "/images/model_b_output.png",
  "display_order": ["B", "A"],  // B was shown on left
  "annotations": {
    "overall": 1,  // A slightly better (adjusted for display order)
    "confidence": 4
  },
  "annotator": "rater_01",
  "timestamp": "2024-12-25T14:30:00Z"
}

Tips for Comparison Tasks

Randomize order: Prevent left/right position bias
Sync controls: Linked zoom/pan helps fair comparison
Clear criteria: Define what "better" means
Attention checks: Include obvious pairs
Time limits: Consider time per comparison for consistency

Next Steps

Set up crowdsourcing for large-scale preference data
Learn about ranking analysis
Explore pairwise comparison documentation

Full comparison documentation at /docs/annotation-types/pairwise-comparison.