Measuring Inter-Annotator Agreement
Calculate and interpret Cohen's Kappa, Fleiss' Kappa, and Krippendorff's Alpha for Potato annotation projects, with Python code examples and interpretation guidelines.
Inter-annotator agreement (IAA) measures how consistently different annotators label the same items. When agreement is high, you can trust the labels. When it is low, the usual culprit is either an unclear guideline or a task that is just inherently subjective.
Why measure agreement?
A few reasons it earns its keep. Low agreement usually points back to instructions that need rewriting, so the number doubles as a check on your guidelines. It also tells you how hard the task really is, since some questions have no single right answer. You can spot which annotators need more training. And if you are publishing, reviewers will expect an agreement figure. Finally, the metric informs how you combine multiple annotators into a single label.
For how Potato handles this end to end, see the source documentation.
Agreement metrics
Cohen's Kappa (2 annotators)
For comparing two annotators on categorical data:
κ = (Po - Pe) / (1 - Pe)
Where:
- Po = observed agreement
- Pe = expected agreement by chance
Interpretation:
| Kappa | Interpretation |
|---|---|
| < 0 | Less than chance |
| 0.01-0.20 | Slight |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Almost perfect |
Fleiss' Kappa (3+ annotators)
For three or more annotators on categorical data:
quality_control:
agreement:
metrics:
- fleiss_kappaSame interpretation scale as Cohen's Kappa.
Krippendorff's Alpha
This is the most flexible of the three. It handles any number of annotators, copes with missing data, and works across nominal, ordinal, interval, and ratio data.
quality_control:
agreement:
metrics:
- krippendorff_alpha
alpha_level: nominal # or ordinal, interval, ratioInterpretation:
- α ≥ 0.80: Reliable
- 0.67 ≤ α < 0.80: Tentatively acceptable
- α < 0.67: Unreliable
Configuring Agreement in Potato
Basic Setup
quality_control:
agreement:
enabled: true
calculate_on_overlap: true
metrics:
- cohens_kappa
- fleiss_kappa
- krippendorff_alpha
# Per annotation scheme
per_scheme: true
# Reporting
report_interval: 100 # Every 100 annotations
export_file: agreement_report.jsonOverlap Configuration
quality_control:
redundancy:
# How many annotators per item
annotations_per_item: 3
# Minimum overlap for calculations
min_overlap_for_agreement: 2
# Sampling for agreement
agreement_sample_size: 100 # Calculate on 100 items
agreement_sample_method: random # or stratified, allCalculating Agreement
In Dashboard
Potato displays agreement metrics in the admin dashboard:
quality_control:
dashboard:
show_agreement: true
agreement_chart: true
update_frequency: 60 # secondsVia API
# Get current agreement metrics
curl http://localhost:8000/api/quality/agreement
# Response:
{
"overall": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"krippendorff_alpha": 0.80
},
"topic": {
"fleiss_kappa": 0.65,
"krippendorff_alpha": 0.68
}
},
"sample_size": 150,
"annotator_pairs": 10
}Via CLI
# Calculate agreement from output files
potato agreement --annotations annotation_output/ --output agreement_report.json
# With specific metric
potato agreement --annotations annotation_output/ --metric krippendorff --level ordinalAgreement for Different Annotation Types
Categorical (Radio, Multiselect)
quality_control:
agreement:
schemes:
sentiment:
type: nominal
metrics: [cohens_kappa, fleiss_kappa]
urgency:
type: ordinal # Low < Medium < High
metrics: [krippendorff_alpha]Likert Scales
quality_control:
agreement:
schemes:
quality_rating:
type: ordinal
metrics: [krippendorff_alpha, weighted_kappa]
# Weighted kappa for ordinal
weighting: linear # or quadraticSpan Annotations
Spans are trickier than categorical labels, since two annotators can agree on a label but disagree on exactly where it starts and ends. NER needs special handling:
quality_control:
agreement:
schemes:
entities:
type: span
span_matching: overlap # or exact, token
# What to compare
compare: label_and_span # or label_only, span_only
# Overlap threshold for "match"
overlap_threshold: 0.5
metrics:
- span_f1
- span_precision
- span_recallRankings
quality_control:
agreement:
schemes:
preference_rank:
type: ranking
metrics:
- kendall_tau
- spearman_rhoPairwise vs Overall Agreement
Pairwise (Each Pair)
quality_control:
agreement:
pairwise: true
output_matrix: true # Agreement matrix
# Output:
# annotator1 × annotator2: κ = 0.75
# annotator1 × annotator3: κ = 0.68
# annotator2 × annotator3: κ = 0.82Overall (All Annotators)
quality_control:
agreement:
overall: true
metrics:
- fleiss_kappa # Designed for 3+ annotators
- krippendorff_alphaHandling Low Agreement
Identify Problem Areas
quality_control:
agreement:
diagnostics:
enabled: true
# Items with most disagreement
show_disagreed_items: true
disagreement_threshold: 0.5
# Labels with most confusion
confusion_matrix: true
# Annotators with low agreement
per_annotator_agreement: trueActions on Low Agreement
quality_control:
agreement:
alerts:
- threshold: 0.6
action: notify
message: "Agreement below 0.6 - review guidelines"
- threshold: 0.4
action: pause
message: "Agreement critically low - pausing task"
# Automatic guideline reminders
show_guidelines_on_low_agreement: true
guideline_threshold: 0.5Complete Configuration
annotation_task_name: "Agreement-Tracked Annotation"
quality_control:
# Redundancy setup
redundancy:
annotations_per_item: 3
assignment_method: random
# Agreement calculation
agreement:
enabled: true
# Metrics
metrics:
- fleiss_kappa
- krippendorff_alpha
# Per-scheme configuration
schemes:
sentiment:
type: nominal
metrics: [fleiss_kappa, cohens_kappa]
intensity:
type: ordinal
metrics: [krippendorff_alpha]
alpha_level: ordinal
entities:
type: span
span_matching: overlap
overlap_threshold: 0.5
metrics: [span_f1]
# Calculation settings
calculate_on_overlap: true
min_overlap: 2
sample_size: all # or number
# Pairwise analysis
pairwise: true
pairwise_output: agreement_matrix.csv
# Diagnostics
diagnostics:
confusion_matrix: true
disagreed_items: true
per_annotator: true
# Alerts
alerts:
- metric: fleiss_kappa
threshold: 0.6
action: notify
# Reporting
report_file: agreement_report.json
report_interval: 50
# Dashboard
dashboard:
show_agreement: true
charts:
- agreement_over_time
- per_scheme_agreement
- annotator_comparisonOutput Report
{
"timestamp": "2024-10-25T15:30:00Z",
"sample_size": 500,
"annotators": ["ann1", "ann2", "ann3"],
"overall_agreement": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"confusion_matrix": {
"Positive": {"Positive": 180, "Negative": 5, "Neutral": 15},
"Negative": {"Positive": 8, "Negative": 165, "Neutral": 12},
"Neutral": {"Positive": 12, "Negative": 10, "Neutral": 93}
}
}
},
"pairwise": {
"ann1_ann2": 0.75,
"ann1_ann3": 0.70,
"ann2_ann3": 0.72
},
"per_annotator": {
"ann1": {"avg_agreement": 0.73, "items_annotated": 500},
"ann2": {"avg_agreement": 0.74, "items_annotated": 500},
"ann3": {"avg_agreement": 0.71, "items_annotated": 500}
},
"most_disagreed_items": [
{"id": "item_234", "disagreement_rate": 1.0},
{"id": "item_567", "disagreement_rate": 0.67}
]
}Best practices
Calculate agreement early rather than waiting until the project ends, because by then it is too late to fix the guidelines. Pick the metric that matches your data: nominal, ordinal, or span. When agreement comes back low, dig in before you blame the annotators, since the problem is often the instructions. Report the figure in any publication. And decide on an acceptable threshold before you start, not after you see the result you got.
For deeper coverage of competence estimation when annotators disagree, see the MACE documentation.
Next steps
- Improve agreement with quality control
- Add training phases for calibration
- Learn to export data with agreement info
Full agreement documentation at /docs/core-concepts/user-management.