Skip to content
Guides5 min read

Measuring Inter-Annotator Agreement

How to calculate and interpret Cohen's Kappa, Fleiss' Kappa, and Krippendorff's Alpha for your annotation projects.

Potato Team·
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

Measuring Inter-Annotator Agreement

Inter-annotator agreement (IAA) measures how consistently different annotators label the same items. High agreement indicates reliable annotations; low agreement suggests unclear guidelines or subjective tasks.

Why Measure Agreement?

  • Validate guidelines: Low agreement → unclear instructions
  • Assess task difficulty: Some tasks are inherently subjective
  • Qualify annotators: Identify who needs more training
  • Report reliability: Required for scientific publications
  • Aggregate labels: Determine how to combine annotations

Agreement Metrics

Cohen's Kappa (2 Annotators)

For comparing two annotators on categorical data:

text
κ = (Po - Pe) / (1 - Pe)

Where:

  • Po = observed agreement
  • Pe = expected agreement by chance

Interpretation:

KappaInterpretation
< 0Less than chance
0.01-0.20Slight
0.21-0.40Fair
0.41-0.60Moderate
0.61-0.80Substantial
0.81-1.00Almost perfect

Fleiss' Kappa (3+ Annotators)

For multiple annotators on categorical data:

yaml
quality_control:
  agreement:
    metrics:
      - fleiss_kappa

Same interpretation scale as Cohen's Kappa.

Krippendorff's Alpha

Most flexible - works with:

  • Any number of annotators
  • Missing data
  • Various data types (nominal, ordinal, interval, ratio)
yaml
quality_control:
  agreement:
    metrics:
      - krippendorff_alpha
    alpha_level: nominal  # or ordinal, interval, ratio

Interpretation:

  • α ≥ 0.80: Reliable
  • 0.67 ≤ α < 0.80: Tentatively acceptable
  • α < 0.67: Unreliable

Configuring Agreement in Potato

Basic Setup

yaml
quality_control:
  agreement:
    enabled: true
    calculate_on_overlap: true
 
    metrics:
      - cohens_kappa
      - fleiss_kappa
      - krippendorff_alpha
 
    # Per annotation scheme
    per_scheme: true
 
    # Reporting
    report_interval: 100  # Every 100 annotations
    export_file: agreement_report.json

Overlap Configuration

yaml
quality_control:
  redundancy:
    # How many annotators per item
    annotations_per_item: 3
 
    # Minimum overlap for calculations
    min_overlap_for_agreement: 2
 
    # Sampling for agreement
    agreement_sample_size: 100  # Calculate on 100 items
    agreement_sample_method: random  # or stratified, all

Calculating Agreement

In Dashboard

Potato displays agreement metrics in the admin dashboard:

yaml
quality_control:
  dashboard:
    show_agreement: true
    agreement_chart: true
    update_frequency: 60  # seconds

Via API

bash
# Get current agreement metrics
curl http://localhost:8000/api/quality/agreement
 
# Response:
{
  "overall": {
    "fleiss_kappa": 0.72,
    "krippendorff_alpha": 0.75
  },
  "per_scheme": {
    "sentiment": {
      "fleiss_kappa": 0.78,
      "krippendorff_alpha": 0.80
    },
    "topic": {
      "fleiss_kappa": 0.65,
      "krippendorff_alpha": 0.68
    }
  },
  "sample_size": 150,
  "annotator_pairs": 10
}

Via CLI

bash
# Calculate agreement from output files
potato agreement --annotations annotation_output/ --output agreement_report.json
 
# With specific metric
potato agreement --annotations annotation_output/ --metric krippendorff --level ordinal

Agreement for Different Annotation Types

Categorical (Radio, Multiselect)

yaml
quality_control:
  agreement:
    schemes:
      sentiment:
        type: nominal
        metrics: [cohens_kappa, fleiss_kappa]
 
      urgency:
        type: ordinal  # Low < Medium < High
        metrics: [krippendorff_alpha]

Likert Scales

yaml
quality_control:
  agreement:
    schemes:
      quality_rating:
        type: ordinal
        metrics: [krippendorff_alpha, weighted_kappa]
 
        # Weighted kappa for ordinal
        weighting: linear  # or quadratic

Span Annotations

For NER, spans require special handling:

yaml
quality_control:
  agreement:
    schemes:
      entities:
        type: span
        span_matching: overlap  # or exact, token
 
        # What to compare
        compare: label_and_span  # or label_only, span_only
 
        # Overlap threshold for "match"
        overlap_threshold: 0.5
 
        metrics:
          - span_f1
          - span_precision
          - span_recall

Rankings

yaml
quality_control:
  agreement:
    schemes:
      preference_rank:
        type: ranking
        metrics:
          - kendall_tau
          - spearman_rho

Pairwise vs Overall Agreement

Pairwise (Each Pair)

yaml
quality_control:
  agreement:
    pairwise: true
    output_matrix: true  # Agreement matrix
 
# Output:
# annotator1 × annotator2: κ = 0.75
# annotator1 × annotator3: κ = 0.68
# annotator2 × annotator3: κ = 0.82

Overall (All Annotators)

yaml
quality_control:
  agreement:
    overall: true
    metrics:
      - fleiss_kappa  # Designed for 3+ annotators
      - krippendorff_alpha

Handling Low Agreement

Identify Problem Areas

yaml
quality_control:
  agreement:
    diagnostics:
      enabled: true
 
      # Items with most disagreement
      show_disagreed_items: true
      disagreement_threshold: 0.5
 
      # Labels with most confusion
      confusion_matrix: true
 
      # Annotators with low agreement
      per_annotator_agreement: true

Actions on Low Agreement

yaml
quality_control:
  agreement:
    alerts:
      - threshold: 0.6
        action: notify
        message: "Agreement below 0.6 - review guidelines"
 
      - threshold: 0.4
        action: pause
        message: "Agreement critically low - pausing task"
 
    # Automatic guideline reminders
    show_guidelines_on_low_agreement: true
    guideline_threshold: 0.5

Complete Configuration

yaml
annotation_task_name: "Agreement-Tracked Annotation"
 
quality_control:
  # Redundancy setup
  redundancy:
    annotations_per_item: 3
    assignment_method: random
 
  # Agreement calculation
  agreement:
    enabled: true
 
    # Metrics
    metrics:
      - fleiss_kappa
      - krippendorff_alpha
 
    # Per-scheme configuration
    schemes:
      sentiment:
        type: nominal
        metrics: [fleiss_kappa, cohens_kappa]
 
      intensity:
        type: ordinal
        metrics: [krippendorff_alpha]
        alpha_level: ordinal
 
      entities:
        type: span
        span_matching: overlap
        overlap_threshold: 0.5
        metrics: [span_f1]
 
    # Calculation settings
    calculate_on_overlap: true
    min_overlap: 2
    sample_size: all  # or number
 
    # Pairwise analysis
    pairwise: true
    pairwise_output: agreement_matrix.csv
 
    # Diagnostics
    diagnostics:
      confusion_matrix: true
      disagreed_items: true
      per_annotator: true
 
    # Alerts
    alerts:
      - metric: fleiss_kappa
        threshold: 0.6
        action: notify
 
    # Reporting
    report_file: agreement_report.json
    report_interval: 50
 
  # Dashboard
  dashboard:
    show_agreement: true
    charts:
      - agreement_over_time
      - per_scheme_agreement
      - annotator_comparison

Output Report

json
{
  "timestamp": "2024-10-25T15:30:00Z",
  "sample_size": 500,
  "annotators": ["ann1", "ann2", "ann3"],
 
  "overall_agreement": {
    "fleiss_kappa": 0.72,
    "krippendorff_alpha": 0.75
  },
 
  "per_scheme": {
    "sentiment": {
      "fleiss_kappa": 0.78,
      "confusion_matrix": {
        "Positive": {"Positive": 180, "Negative": 5, "Neutral": 15},
        "Negative": {"Positive": 8, "Negative": 165, "Neutral": 12},
        "Neutral": {"Positive": 12, "Negative": 10, "Neutral": 93}
      }
    }
  },
 
  "pairwise": {
    "ann1_ann2": 0.75,
    "ann1_ann3": 0.70,
    "ann2_ann3": 0.72
  },
 
  "per_annotator": {
    "ann1": {"avg_agreement": 0.73, "items_annotated": 500},
    "ann2": {"avg_agreement": 0.74, "items_annotated": 500},
    "ann3": {"avg_agreement": 0.71, "items_annotated": 500}
  },
 
  "most_disagreed_items": [
    {"id": "item_234", "disagreement_rate": 1.0},
    {"id": "item_567", "disagreement_rate": 0.67}
  ]
}

Best Practices

  1. Calculate early: Don't wait until the end
  2. Use appropriate metrics: Nominal vs ordinal vs span
  3. Investigate low agreement: Often reveals guideline issues
  4. Report in publications: Required for scientific work
  5. Set thresholds: Define acceptable levels upfront

Next Steps


Full agreement documentation at /docs/core-concepts/user-management.