Guides5 min read
Measuring Inter-Annotator Agreement
How to calculate and interpret Cohen's Kappa, Fleiss' Kappa, and Krippendorff's Alpha for your annotation projects.
Potato Team·
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।
Measuring Inter-Annotator Agreement
Inter-annotator agreement (IAA) measures how consistently different annotators label the same items. High agreement indicates reliable annotations; low agreement suggests unclear guidelines or subjective tasks.
Why Measure Agreement?
- Validate guidelines: Low agreement → unclear instructions
- Assess task difficulty: Some tasks are inherently subjective
- Qualify annotators: Identify who needs more training
- Report reliability: Required for scientific publications
- Aggregate labels: Determine how to combine annotations
Agreement Metrics
Cohen's Kappa (2 Annotators)
For comparing two annotators on categorical data:
text
κ = (Po - Pe) / (1 - Pe)
Where:
- Po = observed agreement
- Pe = expected agreement by chance
Interpretation:
| Kappa | Interpretation |
|---|---|
| < 0 | Less than chance |
| 0.01-0.20 | Slight |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Almost perfect |
Fleiss' Kappa (3+ Annotators)
For multiple annotators on categorical data:
yaml
quality_control:
agreement:
metrics:
- fleiss_kappaSame interpretation scale as Cohen's Kappa.
Krippendorff's Alpha
Most flexible - works with:
- Any number of annotators
- Missing data
- Various data types (nominal, ordinal, interval, ratio)
yaml
quality_control:
agreement:
metrics:
- krippendorff_alpha
alpha_level: nominal # or ordinal, interval, ratioInterpretation:
- α ≥ 0.80: Reliable
- 0.67 ≤ α < 0.80: Tentatively acceptable
- α < 0.67: Unreliable
Configuring Agreement in Potato
Basic Setup
yaml
quality_control:
agreement:
enabled: true
calculate_on_overlap: true
metrics:
- cohens_kappa
- fleiss_kappa
- krippendorff_alpha
# Per annotation scheme
per_scheme: true
# Reporting
report_interval: 100 # Every 100 annotations
export_file: agreement_report.jsonOverlap Configuration
yaml
quality_control:
redundancy:
# How many annotators per item
annotations_per_item: 3
# Minimum overlap for calculations
min_overlap_for_agreement: 2
# Sampling for agreement
agreement_sample_size: 100 # Calculate on 100 items
agreement_sample_method: random # or stratified, allCalculating Agreement
In Dashboard
Potato displays agreement metrics in the admin dashboard:
yaml
quality_control:
dashboard:
show_agreement: true
agreement_chart: true
update_frequency: 60 # secondsVia API
bash
# Get current agreement metrics
curl http://localhost:8000/api/quality/agreement
# Response:
{
"overall": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"krippendorff_alpha": 0.80
},
"topic": {
"fleiss_kappa": 0.65,
"krippendorff_alpha": 0.68
}
},
"sample_size": 150,
"annotator_pairs": 10
}Via CLI
bash
# Calculate agreement from output files
potato agreement --annotations annotation_output/ --output agreement_report.json
# With specific metric
potato agreement --annotations annotation_output/ --metric krippendorff --level ordinalAgreement for Different Annotation Types
Categorical (Radio, Multiselect)
yaml
quality_control:
agreement:
schemes:
sentiment:
type: nominal
metrics: [cohens_kappa, fleiss_kappa]
urgency:
type: ordinal # Low < Medium < High
metrics: [krippendorff_alpha]Likert Scales
yaml
quality_control:
agreement:
schemes:
quality_rating:
type: ordinal
metrics: [krippendorff_alpha, weighted_kappa]
# Weighted kappa for ordinal
weighting: linear # or quadraticSpan Annotations
For NER, spans require special handling:
yaml
quality_control:
agreement:
schemes:
entities:
type: span
span_matching: overlap # or exact, token
# What to compare
compare: label_and_span # or label_only, span_only
# Overlap threshold for "match"
overlap_threshold: 0.5
metrics:
- span_f1
- span_precision
- span_recallRankings
yaml
quality_control:
agreement:
schemes:
preference_rank:
type: ranking
metrics:
- kendall_tau
- spearman_rhoPairwise vs Overall Agreement
Pairwise (Each Pair)
yaml
quality_control:
agreement:
pairwise: true
output_matrix: true # Agreement matrix
# Output:
# annotator1 × annotator2: κ = 0.75
# annotator1 × annotator3: κ = 0.68
# annotator2 × annotator3: κ = 0.82Overall (All Annotators)
yaml
quality_control:
agreement:
overall: true
metrics:
- fleiss_kappa # Designed for 3+ annotators
- krippendorff_alphaHandling Low Agreement
Identify Problem Areas
yaml
quality_control:
agreement:
diagnostics:
enabled: true
# Items with most disagreement
show_disagreed_items: true
disagreement_threshold: 0.5
# Labels with most confusion
confusion_matrix: true
# Annotators with low agreement
per_annotator_agreement: trueActions on Low Agreement
yaml
quality_control:
agreement:
alerts:
- threshold: 0.6
action: notify
message: "Agreement below 0.6 - review guidelines"
- threshold: 0.4
action: pause
message: "Agreement critically low - pausing task"
# Automatic guideline reminders
show_guidelines_on_low_agreement: true
guideline_threshold: 0.5Complete Configuration
yaml
annotation_task_name: "Agreement-Tracked Annotation"
quality_control:
# Redundancy setup
redundancy:
annotations_per_item: 3
assignment_method: random
# Agreement calculation
agreement:
enabled: true
# Metrics
metrics:
- fleiss_kappa
- krippendorff_alpha
# Per-scheme configuration
schemes:
sentiment:
type: nominal
metrics: [fleiss_kappa, cohens_kappa]
intensity:
type: ordinal
metrics: [krippendorff_alpha]
alpha_level: ordinal
entities:
type: span
span_matching: overlap
overlap_threshold: 0.5
metrics: [span_f1]
# Calculation settings
calculate_on_overlap: true
min_overlap: 2
sample_size: all # or number
# Pairwise analysis
pairwise: true
pairwise_output: agreement_matrix.csv
# Diagnostics
diagnostics:
confusion_matrix: true
disagreed_items: true
per_annotator: true
# Alerts
alerts:
- metric: fleiss_kappa
threshold: 0.6
action: notify
# Reporting
report_file: agreement_report.json
report_interval: 50
# Dashboard
dashboard:
show_agreement: true
charts:
- agreement_over_time
- per_scheme_agreement
- annotator_comparisonOutput Report
json
{
"timestamp": "2024-10-25T15:30:00Z",
"sample_size": 500,
"annotators": ["ann1", "ann2", "ann3"],
"overall_agreement": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"confusion_matrix": {
"Positive": {"Positive": 180, "Negative": 5, "Neutral": 15},
"Negative": {"Positive": 8, "Negative": 165, "Neutral": 12},
"Neutral": {"Positive": 12, "Negative": 10, "Neutral": 93}
}
}
},
"pairwise": {
"ann1_ann2": 0.75,
"ann1_ann3": 0.70,
"ann2_ann3": 0.72
},
"per_annotator": {
"ann1": {"avg_agreement": 0.73, "items_annotated": 500},
"ann2": {"avg_agreement": 0.74, "items_annotated": 500},
"ann3": {"avg_agreement": 0.71, "items_annotated": 500}
},
"most_disagreed_items": [
{"id": "item_234", "disagreement_rate": 1.0},
{"id": "item_567", "disagreement_rate": 0.67}
]
}Best Practices
- Calculate early: Don't wait until the end
- Use appropriate metrics: Nominal vs ordinal vs span
- Investigate low agreement: Often reveals guideline issues
- Report in publications: Required for scientific work
- Set thresholds: Define acceptable levels upfront
Next Steps
- Improve agreement with quality control
- Add training phases for calibration
- Learn to export data with agreement info
Full agreement documentation at /docs/core-concepts/user-management.