Guides5 min read
Measuring Inter-Annotator Agreement
How to calculate and interpret Cohen's Kappa, Fleiss' Kappa, and Krippendorff's Alpha for your annotation projects.
By Potato Team·
Measuring Inter-Annotator Agreement
Inter-annotator agreement (IAA) measures how consistently different annotators label the same items. High agreement indicates reliable annotations; low agreement suggests unclear guidelines or subjective tasks.
Why Measure Agreement?
- Validate guidelines: Low agreement → unclear instructions
- Assess task difficulty: Some tasks are inherently subjective
- Qualify annotators: Identify who needs more training
- Report reliability: Required for scientific publications
- Aggregate labels: Determine how to combine annotations
Agreement Metrics
Cohen's Kappa (2 Annotators)
For comparing two annotators on categorical data:
κ = (Po - Pe) / (1 - Pe)
Where:
- Po = observed agreement
- Pe = expected agreement by chance
Interpretation:
| Kappa | Interpretation |
|---|---|
| < 0 | Less than chance |
| 0.01-0.20 | Slight |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Almost perfect |
Fleiss' Kappa (3+ Annotators)
For multiple annotators on categorical data:
quality_control:
agreement:
metrics:
- fleiss_kappaSame interpretation scale as Cohen's Kappa.
Krippendorff's Alpha
Most flexible - works with:
- Any number of annotators
- Missing data
- Various data types (nominal, ordinal, interval, ratio)
quality_control:
agreement:
metrics:
- krippendorff_alpha
alpha_level: nominal # or ordinal, interval, ratioInterpretation:
- α ≥ 0.80: Reliable
- 0.67 ≤ α < 0.80: Tentatively acceptable
- α < 0.67: Unreliable
Configuring Agreement in Potato
Basic Setup
quality_control:
agreement:
enabled: true
calculate_on_overlap: true
metrics:
- cohens_kappa
- fleiss_kappa
- krippendorff_alpha
# Per annotation scheme
per_scheme: true
# Reporting
report_interval: 100 # Every 100 annotations
export_file: agreement_report.jsonOverlap Configuration
quality_control:
redundancy:
# How many annotators per item
annotations_per_item: 3
# Minimum overlap for calculations
min_overlap_for_agreement: 2
# Sampling for agreement
agreement_sample_size: 100 # Calculate on 100 items
agreement_sample_method: random # or stratified, allCalculating Agreement
In Dashboard
Potato displays agreement metrics in the admin dashboard:
quality_control:
dashboard:
show_agreement: true
agreement_chart: true
update_frequency: 60 # secondsVia API
# Get current agreement metrics
curl http://localhost:8000/api/quality/agreement
# Response:
{
"overall": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"krippendorff_alpha": 0.80
},
"topic": {
"fleiss_kappa": 0.65,
"krippendorff_alpha": 0.68
}
},
"sample_size": 150,
"annotator_pairs": 10
}Via CLI
# Calculate agreement from output files
potato agreement --annotations annotation_output/ --output agreement_report.json
# With specific metric
potato agreement --annotations annotation_output/ --metric krippendorff --level ordinalAgreement for Different Annotation Types
Categorical (Radio, Multiselect)
quality_control:
agreement:
schemes:
sentiment:
type: nominal
metrics: [cohens_kappa, fleiss_kappa]
urgency:
type: ordinal # Low < Medium < High
metrics: [krippendorff_alpha]Likert Scales
quality_control:
agreement:
schemes:
quality_rating:
type: ordinal
metrics: [krippendorff_alpha, weighted_kappa]
# Weighted kappa for ordinal
weighting: linear # or quadraticSpan Annotations
For NER, spans require special handling:
quality_control:
agreement:
schemes:
entities:
type: span
span_matching: overlap # or exact, token
# What to compare
compare: label_and_span # or label_only, span_only
# Overlap threshold for "match"
overlap_threshold: 0.5
metrics:
- span_f1
- span_precision
- span_recallRankings
quality_control:
agreement:
schemes:
preference_rank:
type: ranking
metrics:
- kendall_tau
- spearman_rhoPairwise vs Overall Agreement
Pairwise (Each Pair)
quality_control:
agreement:
pairwise: true
output_matrix: true # Agreement matrix
# Output:
# annotator1 × annotator2: κ = 0.75
# annotator1 × annotator3: κ = 0.68
# annotator2 × annotator3: κ = 0.82Overall (All Annotators)
quality_control:
agreement:
overall: true
metrics:
- fleiss_kappa # Designed for 3+ annotators
- krippendorff_alphaHandling Low Agreement
Identify Problem Areas
quality_control:
agreement:
diagnostics:
enabled: true
# Items with most disagreement
show_disagreed_items: true
disagreement_threshold: 0.5
# Labels with most confusion
confusion_matrix: true
# Annotators with low agreement
per_annotator_agreement: trueActions on Low Agreement
quality_control:
agreement:
alerts:
- threshold: 0.6
action: notify
message: "Agreement below 0.6 - review guidelines"
- threshold: 0.4
action: pause
message: "Agreement critically low - pausing task"
# Automatic guideline reminders
show_guidelines_on_low_agreement: true
guideline_threshold: 0.5Complete Configuration
annotation_task_name: "Agreement-Tracked Annotation"
quality_control:
# Redundancy setup
redundancy:
annotations_per_item: 3
assignment_method: random
# Agreement calculation
agreement:
enabled: true
# Metrics
metrics:
- fleiss_kappa
- krippendorff_alpha
# Per-scheme configuration
schemes:
sentiment:
type: nominal
metrics: [fleiss_kappa, cohens_kappa]
intensity:
type: ordinal
metrics: [krippendorff_alpha]
alpha_level: ordinal
entities:
type: span
span_matching: overlap
overlap_threshold: 0.5
metrics: [span_f1]
# Calculation settings
calculate_on_overlap: true
min_overlap: 2
sample_size: all # or number
# Pairwise analysis
pairwise: true
pairwise_output: agreement_matrix.csv
# Diagnostics
diagnostics:
confusion_matrix: true
disagreed_items: true
per_annotator: true
# Alerts
alerts:
- metric: fleiss_kappa
threshold: 0.6
action: notify
# Reporting
report_file: agreement_report.json
report_interval: 50
# Dashboard
dashboard:
show_agreement: true
charts:
- agreement_over_time
- per_scheme_agreement
- annotator_comparisonOutput Report
{
"timestamp": "2024-10-25T15:30:00Z",
"sample_size": 500,
"annotators": ["ann1", "ann2", "ann3"],
"overall_agreement": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"confusion_matrix": {
"Positive": {"Positive": 180, "Negative": 5, "Neutral": 15},
"Negative": {"Positive": 8, "Negative": 165, "Neutral": 12},
"Neutral": {"Positive": 12, "Negative": 10, "Neutral": 93}
}
}
},
"pairwise": {
"ann1_ann2": 0.75,
"ann1_ann3": 0.70,
"ann2_ann3": 0.72
},
"per_annotator": {
"ann1": {"avg_agreement": 0.73, "items_annotated": 500},
"ann2": {"avg_agreement": 0.74, "items_annotated": 500},
"ann3": {"avg_agreement": 0.71, "items_annotated": 500}
},
"most_disagreed_items": [
{"id": "item_234", "disagreement_rate": 1.0},
{"id": "item_567", "disagreement_rate": 0.67}
]
}Best Practices
- Calculate early: Don't wait until the end
- Use appropriate metrics: Nominal vs ordinal vs span
- Investigate low agreement: Often reveals guideline issues
- Report in publications: Required for scientific work
- Set thresholds: Define acceptable levels upfront
Next Steps
- Improve agreement with quality control
- Add training phases for calibration
- Learn to export data with agreement info
Full agreement documentation at /docs/core-concepts/user-management.