Docs/Annotation Types

Pairwise Comparison

Compare pairs of items for preference and quality assessment.

Pairwise Comparison

Pairwise comparison presents annotators with two items and asks them to choose which is better, making it ideal for preference learning, quality assessment, and ranking tasks.

Basic Configuration

annotation_schemes:
  - annotation_type: pairwise
    name: preference
    description: "Which response is better?"
    options:
      - label: "A is better"
        value: "A"
      - label: "B is better"
        value: "B"
      - label: "Tie"
        value: "tie"

Data Format

Pairwise tasks require data with two items per instance:

{
  "id": "pair_1",
  "prompt": "What is the capital of France?",
  "response_a": "The capital of France is Paris.",
  "response_b": "Paris is the capital city of France, located in the north-central part of the country."
}

Configure the display fields:

item_a_field: response_a
item_b_field: response_b
context_field: prompt

Configuration Options

Custom Labels

- annotation_type: pairwise
  name: quality
  options:
    - label: "Response A is significantly better"
      value: "A_strong"
    - label: "Response A is slightly better"
      value: "A_weak"
    - label: "About the same"
      value: "tie"
    - label: "Response B is slightly better"
      value: "B_weak"
    - label: "Response B is significantly better"
      value: "B_strong"

Keyboard Shortcuts

- annotation_type: pairwise
  name: preference
  keyboard_shortcuts:
    A: "1"
    B: "2"
    tie: "3"

No Tie Option

Force a choice:

- annotation_type: pairwise
  name: preference
  options:
    - label: "A is better"
      value: "A"
    - label: "B is better"
      value: "B"
  allow_tie: false

Display Configuration

Side-by-Side Layout

display:
  layout: side_by_side
  item_a_title: "Response A"
  item_b_title: "Response B"

Stacked Layout

display:
  layout: stacked
  item_a_title: "Option 1"
  item_b_title: "Option 2"

Show Context

Display shared context above comparisons:

display:
  show_context: true
  context_title: "Question"

Common Use Cases

LLM Response Preference

task_name: "Response Quality Comparison"
 
data_files:
  - path: data/comparisons.json
    item_a_field: response_a
    item_b_field: response_b
    context_field: prompt
 
annotation_schemes:
  - annotation_type: pairwise
    name: overall_preference
    description: "Which response is better overall?"
    options:
      - label: "A is much better"
        value: "A++"
      - label: "A is better"
        value: "A+"
      - label: "About equal"
        value: "="
      - label: "B is better"
        value: "B+"
      - label: "B is much better"
        value: "B++"

Translation Quality

annotation_schemes:
  - annotation_type: pairwise
    name: translation_preference
    description: "Which translation is more accurate?"
    options:
      - label: "Translation A"
        value: "A"
      - label: "Translation B"
        value: "B"
      - label: "Both equally good"
        value: "tie"
      - label: "Both equally bad"
        value: "neither"

Summary Evaluation

annotation_schemes:
  - annotation_type: pairwise
    name: summary_quality
    description: "Which summary better captures the main points?"
    options:
      - label: "Summary A"
        value: "A"
      - label: "Summary B"
        value: "B"
 
  - annotation_type: multiselect
    name: a_advantages
    description: "What makes A better? (if applicable)"
    labels:
      - More concise
      - More accurate
      - Better coverage
      - Clearer language
 
  - annotation_type: multiselect
    name: b_advantages
    description: "What makes B better? (if applicable)"
    labels:
      - More concise
      - More accurate
      - Better coverage
      - Clearer language

Multi-Aspect Comparison

Evaluate multiple dimensions:

annotation_schemes:
  - annotation_type: pairwise
    name: helpfulness
    description: "Which response is more helpful?"
    options:
      - label: "A"
        value: "A"
      - label: "Equal"
        value: "tie"
      - label: "B"
        value: "B"
 
  - annotation_type: pairwise
    name: accuracy
    description: "Which response is more accurate?"
    options:
      - label: "A"
        value: "A"
      - label: "Equal"
        value: "tie"
      - label: "B"
        value: "B"
 
  - annotation_type: pairwise
    name: safety
    description: "Which response is safer?"
    options:
      - label: "A"
        value: "A"
      - label: "Equal"
        value: "tie"
      - label: "B"
        value: "B"

Randomization

Prevent position bias by randomizing order:

randomize_pair_order: true

When enabled, A and B are randomly swapped, with the actual order tracked in output.

With Justification

Require explanations for choices:

annotation_schemes:
  - annotation_type: pairwise
    name: preference
    description: "Which response is better?"
    options:
      - label: "A is better"
        value: "A"
      - label: "B is better"
        value: "B"
 
  - annotation_type: text
    name: justification
    description: "Explain your choice"
    textarea: true
    required: true

Graded Preference Scale

More granular preference options:

- annotation_type: pairwise
  name: preference
  scale_type: 7point
  options:
    - label: "A is much better"
      value: -3
    - label: "A is better"
      value: -2
    - label: "A is slightly better"
      value: -1
    - label: "Equal"
      value: 0
    - label: "B is slightly better"
      value: 1
    - label: "B is better"
      value: 2
    - label: "B is much better"
      value: 3

Output Format

{
  "id": "pair_1",
  "preference": "A",
  "justification": "Response A is more concise while still being accurate.",
  "display_order": ["response_a", "response_b"]
}

With randomization, display_order indicates what the annotator saw.

Full Example: RLHF Data Collection

task_name: "AI Response Preference Collection"
 
data_files:
  - path: data/model_outputs.json
    item_a_field: model_a_response
    item_b_field: model_b_response
    context_field: user_query
 
display:
  layout: side_by_side
  show_context: true
  context_title: "User Query"
  item_a_title: "Response A"
  item_b_title: "Response B"
 
randomize_pair_order: true
 
annotation_schemes:
  - annotation_type: pairwise
    name: overall
    description: "Overall, which response is better?"
    options:
      - label: "A is significantly better"
        value: "A++"
      - label: "A is better"
        value: "A+"
      - label: "About the same"
        value: "="
      - label: "B is better"
        value: "B+"
      - label: "B is significantly better"
        value: "B++"
    keyboard_shortcuts:
      "A++": "1"
      "A+": "2"
      "=": "3"
      "B+": "4"
      "B++": "5"
 
  - annotation_type: multiselect
    name: criteria
    description: "What factors influenced your decision?"
    labels:
      - Accuracy
      - Helpfulness
      - Clarity
      - Safety
      - Completeness
 
  - annotation_type: text
    name: notes
    description: "Additional notes (optional)"
    textarea: true
    required: false

Best Practices

  1. Use clear, distinct labels - Annotators should instantly understand options
  2. Consider tie options carefully - Sometimes forcing a choice is appropriate
  3. Enable randomization - Prevents position bias
  4. Add justification fields - Helps understand reasoning and improves data quality
  5. Use keyboard shortcuts - Speeds up annotation significantly
  6. Test with your data - Ensure display works well with your content length