Guides4 min read
衡量标注者间一致性
如何计算和解读 Cohen's Kappa、Fleiss' Kappa 和 Krippendorff's Alpha。
Potato Team·
衡量标注者间一致性
标注者间一致性(IAA)衡量不同标注者对相同项目标注的一致程度。高一致性表明标注可靠;低一致性则说明指南不清晰或任务具有主观性。
为什么要衡量一致性?
- 验证指南:低一致性说明指南不明确
- 评估任务难度:某些任务本身就具有主观性
- 评价标注者:识别需要更多培训的人
- 报告可靠性:学术发表的必要要求
- 聚合标签:决定如何合并标注结果
一致性指标
Cohen's Kappa(2个标注者)
用于比较两个标注者的分类数据:
text
κ = (Po - Pe) / (1 - Pe)
其中:
- Po = 观察到的一致率
- Pe = 偶然预期的一致率
解读标准:
| Kappa 值 | 解读 |
|---|---|
| < 0 | 低于偶然水平 |
| 0.01-0.20 | 轻微一致 |
| 0.21-0.40 | 一般一致 |
| 0.41-0.60 | 中等一致 |
| 0.61-0.80 | 较高一致 |
| 0.81-1.00 | 几乎完全一致 |
Fleiss' Kappa(3个及以上标注者)
用于多个标注者的分类数据:
yaml
quality_control:
agreement:
metrics:
- fleiss_kappa解读标准与 Cohen's Kappa 相同。
Krippendorff's Alpha
最灵活的指标——支持:
- 任意数量的标注者
- 缺失数据
- 各种数据类型(名义、序数、区间、比率)
yaml
quality_control:
agreement:
metrics:
- krippendorff_alpha
alpha_level: nominal # or ordinal, interval, ratio解读标准:
- α ≥ 0.80:可靠
- 0.67 ≤ α < 0.80:勉强可接受
- α < 0.67:不可靠
在 Potato 中配置一致性
基本设置
yaml
quality_control:
agreement:
enabled: true
calculate_on_overlap: true
metrics:
- cohens_kappa
- fleiss_kappa
- krippendorff_alpha
# Per annotation scheme
per_scheme: true
# Reporting
report_interval: 100 # Every 100 annotations
export_file: agreement_report.json重叠配置
yaml
quality_control:
redundancy:
# How many annotators per item
annotations_per_item: 3
# Minimum overlap for calculations
min_overlap_for_agreement: 2
# Sampling for agreement
agreement_sample_size: 100 # Calculate on 100 items
agreement_sample_method: random # or stratified, all计算一致性
在仪表板中
Potato 在管理仪表板中显示一致性指标:
yaml
quality_control:
dashboard:
show_agreement: true
agreement_chart: true
update_frequency: 60 # seconds通过 API
bash
# Get current agreement metrics
curl http://localhost:8000/api/quality/agreement
# Response:
{
"overall": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"krippendorff_alpha": 0.80
},
"topic": {
"fleiss_kappa": 0.65,
"krippendorff_alpha": 0.68
}
},
"sample_size": 150,
"annotator_pairs": 10
}通过命令行
bash
# Calculate agreement from output files
potato agreement --annotations annotation_output/ --output agreement_report.json
# With specific metric
potato agreement --annotations annotation_output/ --metric krippendorff --level ordinal不同标注类型的一致性
分类型(单选、多选)
yaml
quality_control:
agreement:
schemes:
sentiment:
type: nominal
metrics: [cohens_kappa, fleiss_kappa]
urgency:
type: ordinal # Low < Medium < High
metrics: [krippendorff_alpha]李克特量表
yaml
quality_control:
agreement:
schemes:
quality_rating:
type: ordinal
metrics: [krippendorff_alpha, weighted_kappa]
# Weighted kappa for ordinal
weighting: linear # or quadraticSpan 标注
对于 NER,span 需要特殊处理:
yaml
quality_control:
agreement:
schemes:
entities:
type: span
span_matching: overlap # or exact, token
# What to compare
compare: label_and_span # or label_only, span_only
# Overlap threshold for "match"
overlap_threshold: 0.5
metrics:
- span_f1
- span_precision
- span_recall排名
yaml
quality_control:
agreement:
schemes:
preference_rank:
type: ranking
metrics:
- kendall_tau
- spearman_rho成对 vs 总体一致性
成对(每对)
yaml
quality_control:
agreement:
pairwise: true
output_matrix: true # Agreement matrix
# Output:
# annotator1 × annotator2: κ = 0.75
# annotator1 × annotator3: κ = 0.68
# annotator2 × annotator3: κ = 0.82总体(所有标注者)
yaml
quality_control:
agreement:
overall: true
metrics:
- fleiss_kappa # Designed for 3+ annotators
- krippendorff_alpha处理低一致性
识别问题区域
yaml
quality_control:
agreement:
diagnostics:
enabled: true
# Items with most disagreement
show_disagreed_items: true
disagreement_threshold: 0.5
# Labels with most confusion
confusion_matrix: true
# Annotators with low agreement
per_annotator_agreement: true低一致性时的操作
yaml
quality_control:
agreement:
alerts:
- threshold: 0.6
action: notify
message: "Agreement below 0.6 - review guidelines"
- threshold: 0.4
action: pause
message: "Agreement critically low - pausing task"
# Automatic guideline reminders
show_guidelines_on_low_agreement: true
guideline_threshold: 0.5完整配置
yaml
annotation_task_name: "Agreement-Tracked Annotation"
quality_control:
# Redundancy setup
redundancy:
annotations_per_item: 3
assignment_method: random
# Agreement calculation
agreement:
enabled: true
# Metrics
metrics:
- fleiss_kappa
- krippendorff_alpha
# Per-scheme configuration
schemes:
sentiment:
type: nominal
metrics: [fleiss_kappa, cohens_kappa]
intensity:
type: ordinal
metrics: [krippendorff_alpha]
alpha_level: ordinal
entities:
type: span
span_matching: overlap
overlap_threshold: 0.5
metrics: [span_f1]
# Calculation settings
calculate_on_overlap: true
min_overlap: 2
sample_size: all # or number
# Pairwise analysis
pairwise: true
pairwise_output: agreement_matrix.csv
# Diagnostics
diagnostics:
confusion_matrix: true
disagreed_items: true
per_annotator: true
# Alerts
alerts:
- metric: fleiss_kappa
threshold: 0.6
action: notify
# Reporting
report_file: agreement_report.json
report_interval: 50
# Dashboard
dashboard:
show_agreement: true
charts:
- agreement_over_time
- per_scheme_agreement
- annotator_comparison输出报告
json
{
"timestamp": "2024-10-25T15:30:00Z",
"sample_size": 500,
"annotators": ["ann1", "ann2", "ann3"],
"overall_agreement": {
"fleiss_kappa": 0.72,
"krippendorff_alpha": 0.75
},
"per_scheme": {
"sentiment": {
"fleiss_kappa": 0.78,
"confusion_matrix": {
"Positive": {"Positive": 180, "Negative": 5, "Neutral": 15},
"Negative": {"Positive": 8, "Negative": 165, "Neutral": 12},
"Neutral": {"Positive": 12, "Negative": 10, "Neutral": 93}
}
}
},
"pairwise": {
"ann1_ann2": 0.75,
"ann1_ann3": 0.70,
"ann2_ann3": 0.72
},
"per_annotator": {
"ann1": {"avg_agreement": 0.73, "items_annotated": 500},
"ann2": {"avg_agreement": 0.74, "items_annotated": 500},
"ann3": {"avg_agreement": 0.71, "items_annotated": 500}
},
"most_disagreed_items": [
{"id": "item_234", "disagreement_rate": 1.0},
{"id": "item_567", "disagreement_rate": 0.67}
]
}最佳实践
- 尽早计算:不要等到最后才计算
- 使用合适的指标:区分名义型、序数型和 span
- 调查低一致性:通常能发现指南问题
- 在论文中报告:学术工作的必需项
- 设定阈值:提前定义可接受的水平
下一步
完整的一致性文档请参阅 /docs/core-concepts/user-management。