Skip to content
Guides6 min read

Measuring Inter-Annotator Agreement

Calculate and interpret Cohen's Kappa, Fleiss' Kappa, and Krippendorff's Alpha for Potato annotation projects, with Python code examples and interpretation guidelines.

Potato Team

Inter-annotator agreement (IAA) measures how consistently different annotators label the same items. When agreement is high, you can trust the labels. When it is low, the usual culprit is either an unclear guideline or a task that is just inherently subjective.

Why measure agreement?

A few reasons it earns its keep. Low agreement usually points back to instructions that need rewriting, so the number doubles as a check on your guidelines. It also tells you how hard the task really is, since some questions have no single right answer. You can spot which annotators need more training. And if you are publishing, reviewers will expect an agreement figure. Finally, the metric informs how you combine multiple annotators into a single label.

For how Potato handles this end to end, see the source documentation.

Agreement metrics

Cohen's Kappa (2 annotators)

For comparing two annotators on categorical data:

text
κ = (Po - Pe) / (1 - Pe)

Where:

  • Po = observed agreement
  • Pe = expected agreement by chance

Interpretation:

KappaInterpretation
< 0Less than chance
0.01-0.20Slight
0.21-0.40Fair
0.41-0.60Moderate
0.61-0.80Substantial
0.81-1.00Almost perfect

Fleiss' Kappa (3+ annotators)

For three or more annotators on categorical data:

yaml
quality_control:
  agreement:
    metrics:
      - fleiss_kappa

Same interpretation scale as Cohen's Kappa.

Krippendorff's Alpha

This is the most flexible of the three. It handles any number of annotators, copes with missing data, and works across nominal, ordinal, interval, and ratio data.

yaml
quality_control:
  agreement:
    metrics:
      - krippendorff_alpha
    alpha_level: nominal  # or ordinal, interval, ratio

Interpretation:

  • α ≥ 0.80: Reliable
  • 0.67 ≤ α < 0.80: Tentatively acceptable
  • α < 0.67: Unreliable

Configuring Agreement in Potato

Basic Setup

yaml
quality_control:
  agreement:
    enabled: true
    calculate_on_overlap: true
 
    metrics:
      - cohens_kappa
      - fleiss_kappa
      - krippendorff_alpha
 
    # Per annotation scheme
    per_scheme: true
 
    # Reporting
    report_interval: 100  # Every 100 annotations
    export_file: agreement_report.json

Overlap Configuration

yaml
quality_control:
  redundancy:
    # How many annotators per item
    annotations_per_item: 3
 
    # Minimum overlap for calculations
    min_overlap_for_agreement: 2
 
    # Sampling for agreement
    agreement_sample_size: 100  # Calculate on 100 items
    agreement_sample_method: random  # or stratified, all

Calculating Agreement

In Dashboard

Potato displays agreement metrics in the admin dashboard:

yaml
quality_control:
  dashboard:
    show_agreement: true
    agreement_chart: true
    update_frequency: 60  # seconds

Via API

bash
# Get current agreement metrics
curl http://localhost:8000/api/quality/agreement
 
# Response:
{
  "overall": {
    "fleiss_kappa": 0.72,
    "krippendorff_alpha": 0.75
  },
  "per_scheme": {
    "sentiment": {
      "fleiss_kappa": 0.78,
      "krippendorff_alpha": 0.80
    },
    "topic": {
      "fleiss_kappa": 0.65,
      "krippendorff_alpha": 0.68
    }
  },
  "sample_size": 150,
  "annotator_pairs": 10
}

Via CLI

bash
# Calculate agreement from output files
potato agreement --annotations annotation_output/ --output agreement_report.json
 
# With specific metric
potato agreement --annotations annotation_output/ --metric krippendorff --level ordinal

Agreement for Different Annotation Types

Categorical (Radio, Multiselect)

yaml
quality_control:
  agreement:
    schemes:
      sentiment:
        type: nominal
        metrics: [cohens_kappa, fleiss_kappa]
 
      urgency:
        type: ordinal  # Low < Medium < High
        metrics: [krippendorff_alpha]

Likert Scales

yaml
quality_control:
  agreement:
    schemes:
      quality_rating:
        type: ordinal
        metrics: [krippendorff_alpha, weighted_kappa]
 
        # Weighted kappa for ordinal
        weighting: linear  # or quadratic

Span Annotations

Spans are trickier than categorical labels, since two annotators can agree on a label but disagree on exactly where it starts and ends. NER needs special handling:

yaml
quality_control:
  agreement:
    schemes:
      entities:
        type: span
        span_matching: overlap  # or exact, token
 
        # What to compare
        compare: label_and_span  # or label_only, span_only
 
        # Overlap threshold for "match"
        overlap_threshold: 0.5
 
        metrics:
          - span_f1
          - span_precision
          - span_recall

Rankings

yaml
quality_control:
  agreement:
    schemes:
      preference_rank:
        type: ranking
        metrics:
          - kendall_tau
          - spearman_rho

Pairwise vs Overall Agreement

Pairwise (Each Pair)

yaml
quality_control:
  agreement:
    pairwise: true
    output_matrix: true  # Agreement matrix
 
# Output:
# annotator1 × annotator2: κ = 0.75
# annotator1 × annotator3: κ = 0.68
# annotator2 × annotator3: κ = 0.82

Overall (All Annotators)

yaml
quality_control:
  agreement:
    overall: true
    metrics:
      - fleiss_kappa  # Designed for 3+ annotators
      - krippendorff_alpha

Handling Low Agreement

Identify Problem Areas

yaml
quality_control:
  agreement:
    diagnostics:
      enabled: true
 
      # Items with most disagreement
      show_disagreed_items: true
      disagreement_threshold: 0.5
 
      # Labels with most confusion
      confusion_matrix: true
 
      # Annotators with low agreement
      per_annotator_agreement: true

Actions on Low Agreement

yaml
quality_control:
  agreement:
    alerts:
      - threshold: 0.6
        action: notify
        message: "Agreement below 0.6 - review guidelines"
 
      - threshold: 0.4
        action: pause
        message: "Agreement critically low - pausing task"
 
    # Automatic guideline reminders
    show_guidelines_on_low_agreement: true
    guideline_threshold: 0.5

Complete Configuration

yaml
annotation_task_name: "Agreement-Tracked Annotation"
 
quality_control:
  # Redundancy setup
  redundancy:
    annotations_per_item: 3
    assignment_method: random
 
  # Agreement calculation
  agreement:
    enabled: true
 
    # Metrics
    metrics:
      - fleiss_kappa
      - krippendorff_alpha
 
    # Per-scheme configuration
    schemes:
      sentiment:
        type: nominal
        metrics: [fleiss_kappa, cohens_kappa]
 
      intensity:
        type: ordinal
        metrics: [krippendorff_alpha]
        alpha_level: ordinal
 
      entities:
        type: span
        span_matching: overlap
        overlap_threshold: 0.5
        metrics: [span_f1]
 
    # Calculation settings
    calculate_on_overlap: true
    min_overlap: 2
    sample_size: all  # or number
 
    # Pairwise analysis
    pairwise: true
    pairwise_output: agreement_matrix.csv
 
    # Diagnostics
    diagnostics:
      confusion_matrix: true
      disagreed_items: true
      per_annotator: true
 
    # Alerts
    alerts:
      - metric: fleiss_kappa
        threshold: 0.6
        action: notify
 
    # Reporting
    report_file: agreement_report.json
    report_interval: 50
 
  # Dashboard
  dashboard:
    show_agreement: true
    charts:
      - agreement_over_time
      - per_scheme_agreement
      - annotator_comparison

Output Report

json
{
  "timestamp": "2024-10-25T15:30:00Z",
  "sample_size": 500,
  "annotators": ["ann1", "ann2", "ann3"],
 
  "overall_agreement": {
    "fleiss_kappa": 0.72,
    "krippendorff_alpha": 0.75
  },
 
  "per_scheme": {
    "sentiment": {
      "fleiss_kappa": 0.78,
      "confusion_matrix": {
        "Positive": {"Positive": 180, "Negative": 5, "Neutral": 15},
        "Negative": {"Positive": 8, "Negative": 165, "Neutral": 12},
        "Neutral": {"Positive": 12, "Negative": 10, "Neutral": 93}
      }
    }
  },
 
  "pairwise": {
    "ann1_ann2": 0.75,
    "ann1_ann3": 0.70,
    "ann2_ann3": 0.72
  },
 
  "per_annotator": {
    "ann1": {"avg_agreement": 0.73, "items_annotated": 500},
    "ann2": {"avg_agreement": 0.74, "items_annotated": 500},
    "ann3": {"avg_agreement": 0.71, "items_annotated": 500}
  },
 
  "most_disagreed_items": [
    {"id": "item_234", "disagreement_rate": 1.0},
    {"id": "item_567", "disagreement_rate": 0.67}
  ]
}

Best practices

Calculate agreement early rather than waiting until the project ends, because by then it is too late to fix the guidelines. Pick the metric that matches your data: nominal, ordinal, or span. When agreement comes back low, dig in before you blame the annotators, since the problem is often the instructions. Report the figure in any publication. And decide on an acceptable threshold before you start, not after you see the result you got.

For deeper coverage of competence estimation when annotators disagree, see the MACE documentation.

Next steps


Full agreement documentation at /docs/core-concepts/user-management.