Tutorials6 min read
Image Comparison and Preference Tasks
Build side-by-side image comparison interfaces for preference ranking, A/B testing, and quality assessment.
By Potato Teamยท
Image Comparison and Preference Tasks
Image comparison is essential for training generative models, evaluating image quality, and understanding human preferences. This tutorial covers pairwise comparison, ranking, and A/B testing setups.
Use Cases
- Generative AI: RLHF for image generation models
- Image quality: Comparing compression, enhancement, or restoration
- Design testing: A/B testing visual designs
- Search ranking: Evaluating image retrieval results
Basic Pairwise Comparison
annotation_task_name: "Image Preference"
data_files:
- data/pairs.json
item_properties:
id_key: pair_id
image_a_key: image_left
image_b_key: image_right
image:
enabled: true
layout: side_by_side
display_size: medium
enable_zoom: true
sync_zoom: true # Zoom both images together
annotation_schemes:
- annotation_type: radio
name: preference
description: "Which image do you prefer?"
labels:
- Left is much better
- Left is slightly better
- About the same
- Right is slightly better
- Right is much better
layout: horizontalData Format
{
"pair_id": "pair_001",
"image_left": "/images/model_a_output.png",
"image_right": "/images/model_b_output.png",
"prompt": "A sunset over mountains"
}Enhanced Comparison Interface
annotation_task_name: "AI Image Generation Evaluation"
data_files:
- data/generation_pairs.json
item_properties:
id_key: id
image_a_key: image_a
image_b_key: image_b
context_key: prompt
# Show the generation prompt
display:
show_context: true
context_label: "Generation Prompt"
context_field: prompt
image:
enabled: true
layout: side_by_side
gap: 20 # Pixels between images
labels:
left: "Image A"
right: "Image B"
# Interaction
enable_zoom: true
sync_zoom: true
enable_pan: true
sync_pan: true
# Display
max_height: 500
background: "#1F2937"
border_radius: 8
annotation_schemes:
# Overall preference
- annotation_type: radio
name: overall_preference
description: "Overall, which image is better?"
labels:
- name: A much better
keyboard_shortcut: "1"
- name: A slightly better
keyboard_shortcut: "2"
- name: Tie
keyboard_shortcut: "3"
- name: B slightly better
keyboard_shortcut: "4"
- name: B much better
keyboard_shortcut: "5"
required: true
# Specific criteria
- annotation_type: radio
name: prompt_adherence
description: "Which better matches the prompt?"
labels: [A, Tie, B]
- annotation_type: radio
name: visual_quality
description: "Which has better visual quality (no artifacts)?"
labels: [A, Tie, B]
- annotation_type: radio
name: aesthetic_appeal
description: "Which is more aesthetically pleasing?"
labels: [A, Tie, B]
- annotation_type: radio
name: realism
description: "Which looks more realistic?"
labels: [A, Tie, B, N/A (neither should be realistic)]
# Issues detection
- annotation_type: multiselect
name: issues_a
description: "Issues in Image A (select all)"
labels:
- Distorted faces/hands
- Text rendering issues
- Unnatural lighting
- Missing elements from prompt
- Extra unwanted elements
- Blurry or low quality
- Color issues
- None
- annotation_type: multiselect
name: issues_b
description: "Issues in Image B (select all)"
labels:
- Distorted faces/hands
- Text rendering issues
- Unnatural lighting
- Missing elements from prompt
- Extra unwanted elements
- Blurry or low quality
- Color issues
- NoneBefore/After Comparison
For image enhancement, restoration, or editing:
annotation_task_name: "Image Enhancement Evaluation"
data_files:
- data/enhancements.json
item_properties:
id_key: id
image_a_key: original
image_b_key: enhanced
image:
layout: side_by_side
labels:
left: "Original"
right: "Enhanced"
# Slider comparison
comparison_mode: slider # Drag slider to reveal
slider_position: 50 # Start at middle
annotation_schemes:
- annotation_type: radio
name: enhancement_quality
description: "How well was the image enhanced?"
labels:
- Significantly improved
- Slightly improved
- No noticeable change
- Made worse
- annotation_type: multiselect
name: improvements
description: "What was improved?"
labels:
- Sharpness/detail
- Color accuracy
- Noise reduction
- Dynamic range
- Artifact removal
- Nothing
- annotation_type: multiselect
name: problems_introduced
description: "Any problems introduced?"
labels:
- Over-sharpening/halos
- Color shift
- Loss of detail
- New artifacts
- Unnatural look
- NoneRanking Multiple Images
For ranking more than 2 images:
annotation_task_name: "Image Ranking"
data_files:
- data/image_sets.json
item_properties:
id_key: id
image_list_key: images # Array of image paths
image:
layout: grid
columns: 3
enable_zoom: true
annotation_schemes:
- annotation_type: ranking
name: preference_rank
description: "Rank images from best (1) to worst"
source: images
allow_ties: false
- annotation_type: radio
name: best_for_use
description: "Which would you use for this purpose?"
dynamic_labels_from: imagesData format:
{
"id": "set_001",
"prompt": "A cat sitting on a windowsill",
"images": [
"/images/set001_a.png",
"/images/set001_b.png",
"/images/set001_c.png",
"/images/set001_d.png"
]
}Best-Worst Scaling
Efficient ranking through repeated best-worst choices:
annotation_schemes:
- annotation_type: best_worst
name: preference
description: "Select the BEST and WORST images"
source: images
best_label: "Best"
worst_label: "Worst"
neither_allowed: falseA/B Testing for Design
annotation_task_name: "Design A/B Test"
data_files:
- data/design_variants.json
item_properties:
id_key: id
image_a_key: variant_a
image_b_key: variant_b
context_key: design_context
display:
show_context: true
context_label: "Design Context"
image:
layout: side_by_side
labels:
left: "Design A"
right: "Design B"
randomize_order: true # Prevent position bias
annotation_schemes:
- annotation_type: radio
name: preference
description: "Which design do you prefer?"
labels: [A, No preference, B]
randomize_with_images: true # Labels follow image randomization
- annotation_type: likert
name: a_appeal
description: "Rate Design A's visual appeal"
size: 7
min_label: "Very unappealing"
max_label: "Very appealing"
- annotation_type: likert
name: b_appeal
description: "Rate Design B's visual appeal"
size: 7
min_label: "Very unappealing"
max_label: "Very appealing"
- annotation_type: text
name: reasoning
description: "Why did you choose this preference?"
textarea: true
required: falseComplete Configuration
annotation_task_name: "Generative Model Comparison - RLHF Data"
data_files:
- data/model_outputs.json
item_properties:
id_key: id
image_a_key: model_a_output
image_b_key: model_b_output
context_key: prompt
display:
show_context: true
context_label: "Generation Prompt"
context_style: "highlighted"
image:
enabled: true
layout: side_by_side
gap: 24
labels:
left: "Output A"
right: "Output B"
max_height: 512
enable_zoom: true
sync_zoom: true
enable_pan: true
sync_pan: true
background: "#111827"
border: "1px solid #374151"
border_radius: 8
# Prevent position bias
randomize_order: true
annotation_schemes:
- annotation_type: radio
name: overall
description: "Which image better represents the prompt?"
labels:
- name: A is clearly better
value: 2
keyboard_shortcut: "1"
- name: A is slightly better
value: 1
keyboard_shortcut: "2"
- name: About equal
value: 0
keyboard_shortcut: "3"
- name: B is slightly better
value: -1
keyboard_shortcut: "4"
- name: B is clearly better
value: -2
keyboard_shortcut: "5"
required: true
preserve_with_randomization: true # Values adjust for randomized order
- annotation_type: likert
name: confidence
description: "How confident are you?"
size: 5
min_label: "Guessing"
max_label: "Certain"
annotation_guidelines:
title: "Image Comparison Guidelines"
content: |
## Evaluation Criteria
Consider these factors:
1. **Prompt adherence**: Does it match what was asked?
2. **Visual quality**: Are there artifacts or distortions?
3. **Aesthetics**: Is it visually pleasing?
4. **Realism** (if applicable): Does it look natural?
## Tips
- Zoom in to check for details and artifacts
- Consider the prompt carefully
- Don't let one factor dominate unfairly
quality_control:
attention_checks:
frequency: 15
gold_pairs:
- image_a: "/gold/clearly_better.png"
image_b: "/gold/clearly_worse.png"
expected_preference: ["A is clearly better", "A is slightly better"]
output_annotation_dir: annotations/
output_annotation_format: jsonlOutput Format
{
"pair_id": "pair_001",
"prompt": "A sunset over mountains",
"image_a": "/images/model_a_output.png",
"image_b": "/images/model_b_output.png",
"display_order": ["B", "A"], // B was shown on left
"annotations": {
"overall": 1, // A slightly better (adjusted for display order)
"confidence": 4
},
"annotator": "rater_01",
"timestamp": "2024-12-25T14:30:00Z"
}Tips for Comparison Tasks
- Randomize order: Prevent left/right position bias
- Sync controls: Linked zoom/pan helps fair comparison
- Clear criteria: Define what "better" means
- Attention checks: Include obvious pairs
- Time limits: Consider time per comparison for consistency
Next Steps
- Set up crowdsourcing for large-scale preference data
- Learn about ranking analysis
- Explore pairwise comparison documentation
Full comparison documentation at /docs/annotation-types/pairwise-comparison.