# Measuring Inter-Annotator Agreement

Source: https://www.potatoannotator.com/blog/inter-annotator-agreement

Inter-annotator agreement (IAA) measures how consistently different annotators label the same items. When agreement is high, you can trust the labels. When it is low, the usual culprit is either an unclear guideline or a task that is just inherently subjective.

## Why measure agreement?

A few reasons it earns its keep. Low agreement usually points back to instructions that need rewriting, so the number doubles as a check on your guidelines. It also tells you how hard the task really is, since some questions have no single right answer. You can spot which annotators need more training. And if you are publishing, reviewers will expect an agreement figure. Finally, the metric informs how you combine multiple annotators into a single label.

For how Potato handles this end to end, see the [source documentation](https://github.com/davidjurgens/potato/blob/master/docs/workflow/quality_control.md).

## Agreement metrics

### Cohen's Kappa (2 annotators)

For comparing two annotators on categorical data:

```
κ = (Po - Pe) / (1 - Pe)
```

Where:
- Po = observed agreement
- Pe = expected agreement by chance

**Interpretation:**
| Kappa | Interpretation |
|-------|---------------|
| < 0 | Less than chance |
| 0.01-0.20 | Slight |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Almost perfect |

### Fleiss' Kappa (3+ annotators)

For three or more annotators on categorical data:

```yaml
quality_control:
  agreement:
    metrics:
      - fleiss_kappa
```

Same interpretation scale as Cohen's Kappa.

### Krippendorff's Alpha

This is the most flexible of the three. It handles any number of annotators, copes with missing data, and works across nominal, ordinal, interval, and ratio data.

```yaml
quality_control:
  agreement:
    metrics:
      - krippendorff_alpha
    alpha_level: nominal  # or ordinal, interval, ratio
```

**Interpretation:**
- α ≥ 0.80: Reliable
- 0.67 ≤ α < 0.80: Tentatively acceptable
- α < 0.67: Unreliable

## Configuring Agreement in Potato

### Basic Setup

```yaml
quality_control:
  agreement:
    enabled: true
    calculate_on_overlap: true

    metrics:
      - cohens_kappa
      - fleiss_kappa
      - krippendorff_alpha

    # Per annotation scheme
    per_scheme: true

    # Reporting
    report_interval: 100  # Every 100 annotations
    export_file: agreement_report.json
```

### Overlap Configuration

```yaml
quality_control:
  redundancy:
    # How many annotators per item
    annotations_per_item: 3

    # Minimum overlap for calculations
    min_overlap_for_agreement: 2

    # Sampling for agreement
    agreement_sample_size: 100  # Calculate on 100 items
    agreement_sample_method: random  # or stratified, all
```

## Calculating Agreement

### In Dashboard

Potato displays agreement metrics in the admin dashboard:

```yaml
quality_control:
  dashboard:
    show_agreement: true
    agreement_chart: true
    update_frequency: 60  # seconds
```

### Via API

```bash
# Get current agreement metrics
curl http://localhost:8000/api/quality/agreement

# Response:
{
  "overall": {
    "fleiss_kappa": 0.72,
    "krippendorff_alpha": 0.75
  },
  "per_scheme": {
    "sentiment": {
      "fleiss_kappa": 0.78,
      "krippendorff_alpha": 0.80
    },
    "topic": {
      "fleiss_kappa": 0.65,
      "krippendorff_alpha": 0.68
    }
  },
  "sample_size": 150,
  "annotator_pairs": 10
}
```

### Via CLI

```bash
# Calculate agreement from output files
potato agreement --annotations annotation_output/ --output agreement_report.json

# With specific metric
potato agreement --annotations annotation_output/ --metric krippendorff --level ordinal
```

## Agreement for Different Annotation Types

### Categorical (Radio, Multiselect)

```yaml
quality_control:
  agreement:
    schemes:
      sentiment:
        type: nominal
        metrics: [cohens_kappa, fleiss_kappa]

      urgency:
        type: ordinal  # Low < Medium < High
        metrics: [krippendorff_alpha]
```

### Likert Scales

```yaml
quality_control:
  agreement:
    schemes:
      quality_rating:
        type: ordinal
        metrics: [krippendorff_alpha, weighted_kappa]

        # Weighted kappa for ordinal
        weighting: linear  # or quadratic
```

### Span Annotations

Spans are trickier than categorical labels, since two annotators can agree on a label but disagree on exactly where it starts and ends. NER needs special handling:

```yaml
quality_control:
  agreement:
    schemes:
      entities:
        type: span
        span_matching: overlap  # or exact, token

        # What to compare
        compare: label_and_span  # or label_only, span_only

        # Overlap threshold for "match"
        overlap_threshold: 0.5

        metrics:
          - span_f1
          - span_precision
          - span_recall
```

### Rankings

```yaml
quality_control:
  agreement:
    schemes:
      preference_rank:
        type: ranking
        metrics:
          - kendall_tau
          - spearman_rho
```

## Pairwise vs Overall Agreement

### Pairwise (Each Pair)

```yaml
quality_control:
  agreement:
    pairwise: true
    output_matrix: true  # Agreement matrix

# Output:
# annotator1 × annotator2: κ = 0.75
# annotator1 × annotator3: κ = 0.68
# annotator2 × annotator3: κ = 0.82
```

### Overall (All Annotators)

```yaml
quality_control:
  agreement:
    overall: true
    metrics:
      - fleiss_kappa  # Designed for 3+ annotators
      - krippendorff_alpha
```

## Handling Low Agreement

### Identify Problem Areas

```yaml
quality_control:
  agreement:
    diagnostics:
      enabled: true

      # Items with most disagreement
      show_disagreed_items: true
      disagreement_threshold: 0.5

      # Labels with most confusion
      confusion_matrix: true

      # Annotators with low agreement
      per_annotator_agreement: true
```

### Actions on Low Agreement

```yaml
quality_control:
  agreement:
    alerts:
      - threshold: 0.6
        action: notify
        message: "Agreement below 0.6 - review guidelines"

      - threshold: 0.4
        action: pause
        message: "Agreement critically low - pausing task"

    # Automatic guideline reminders
    show_guidelines_on_low_agreement: true
    guideline_threshold: 0.5
```

## Complete Configuration

```yaml
annotation_task_name: "Agreement-Tracked Annotation"

quality_control:
  # Redundancy setup
  redundancy:
    annotations_per_item: 3
    assignment_method: random

  # Agreement calculation
  agreement:
    enabled: true

    # Metrics
    metrics:
      - fleiss_kappa
      - krippendorff_alpha

    # Per-scheme configuration
    schemes:
      sentiment:
        type: nominal
        metrics: [fleiss_kappa, cohens_kappa]

      intensity:
        type: ordinal
        metrics: [krippendorff_alpha]
        alpha_level: ordinal

      entities:
        type: span
        span_matching: overlap
        overlap_threshold: 0.5
        metrics: [span_f1]

    # Calculation settings
    calculate_on_overlap: true
    min_overlap: 2
    sample_size: all  # or number

    # Pairwise analysis
    pairwise: true
    pairwise_output: agreement_matrix.csv

    # Diagnostics
    diagnostics:
      confusion_matrix: true
      disagreed_items: true
      per_annotator: true

    # Alerts
    alerts:
      - metric: fleiss_kappa
        threshold: 0.6
        action: notify

    # Reporting
    report_file: agreement_report.json
    report_interval: 50

  # Dashboard
  dashboard:
    show_agreement: true
    charts:
      - agreement_over_time
      - per_scheme_agreement
      - annotator_comparison
```

## Output Report

```json
{
  "timestamp": "2024-10-25T15:30:00Z",
  "sample_size": 500,
  "annotators": ["ann1", "ann2", "ann3"],

  "overall_agreement": {
    "fleiss_kappa": 0.72,
    "krippendorff_alpha": 0.75
  },

  "per_scheme": {
    "sentiment": {
      "fleiss_kappa": 0.78,
      "confusion_matrix": {
        "Positive": {"Positive": 180, "Negative": 5, "Neutral": 15},
        "Negative": {"Positive": 8, "Negative": 165, "Neutral": 12},
        "Neutral": {"Positive": 12, "Negative": 10, "Neutral": 93}
      }
    }
  },

  "pairwise": {
    "ann1_ann2": 0.75,
    "ann1_ann3": 0.70,
    "ann2_ann3": 0.72
  },

  "per_annotator": {
    "ann1": {"avg_agreement": 0.73, "items_annotated": 500},
    "ann2": {"avg_agreement": 0.74, "items_annotated": 500},
    "ann3": {"avg_agreement": 0.71, "items_annotated": 500}
  },

  "most_disagreed_items": [
    {"id": "item_234", "disagreement_rate": 1.0},
    {"id": "item_567", "disagreement_rate": 0.67}
  ]
}
```

## Best practices

Calculate agreement early rather than waiting until the project ends, because by then it is too late to fix the guidelines. Pick the metric that matches your data: nominal, ordinal, or span. When agreement comes back low, dig in before you blame the annotators, since the problem is often the instructions. Report the figure in any publication. And decide on an acceptable threshold before you start, not after you see the result you got.

For deeper coverage of competence estimation when annotators disagree, see the [MACE documentation](https://github.com/davidjurgens/potato/blob/master/docs/advanced/mace.md).

## Next steps

- Improve agreement with [quality control](/blog/quality-control-strategies)
- Add [training phases](/docs/features/training-phase) for calibration
- Learn to [export data](/blog/data-format-guide) with agreement info

---

*Full agreement documentation at [/docs/core-concepts/user-management](/docs/core-concepts/user-management).*