Quality Control
Attention checks, gold standards, and inter-annotator agreement metrics.
Quality Control
Potato provides comprehensive quality control features to ensure high-quality annotations. This includes attention checks, gold standards, pre-annotation support, and real-time agreement metrics.
Overview
Quality control in Potato consists of four key features:
- Attention Checks - Verify annotator engagement with known-answer items
- Gold Standards - Track accuracy against expert-labeled items
- Pre-annotation Support - Pre-fill forms with model predictions
- Agreement Metrics - Calculate inter-annotator agreement in real-time
Attention Checks
Attention checks are items with known correct answers that verify annotators are paying attention and not randomly clicking.
Configuration
attention_checks:
enabled: true
items_file: "attention_checks.json"
# How often to inject attention checks
frequency: 10 # Insert one every 10 items
# OR
probability: 0.1 # 10% chance per item
# Optional: flag suspiciously fast responses
min_response_time: 3.0 # Flag if answered in < 3 seconds
# Failure handling
failure_handling:
warn_threshold: 2 # Show warning after 2 failures
warn_message: "Please read items carefully before answering."
block_threshold: 5 # Block user after 5 failures
block_message: "You have been blocked due to too many incorrect responses."Attention Check Items File
[
{
"id": "attn_001",
"text": "Please select 'Positive' for this item to verify you are reading carefully.",
"expected_answer": {
"sentiment": "positive"
}
}
]Gold Standards
Gold standards are expert-labeled items used to measure annotator accuracy. By default, gold standards are silent - results are recorded for admin review, but annotators don't see feedback.
Configuration
gold_standards:
enabled: true
items_file: "gold_standards.json"
# How to use gold standards
mode: "mixed" # Options: training, mixed, separate
frequency: 20 # Insert one every 20 items
# Accuracy requirements
accuracy:
min_threshold: 0.7 # Minimum required accuracy (70%)
evaluation_count: 10 # Evaluate after this many gold items
# Feedback settings (disabled by default)
feedback:
show_correct_answer: false
show_explanation: false
# Auto-promotion from high-agreement items
auto_promote:
enabled: true
min_annotators: 3
agreement_threshold: 1.0 # 1.0 = unanimousGold Standard Items File
[
{
"id": "gold_001",
"text": "The service was absolutely terrible and I will never return.",
"gold_label": {
"sentiment": "negative"
},
"explanation": "Strong negative language clearly indicates negative sentiment.",
"difficulty": "easy"
}
]Auto-Promotion
Items can automatically become gold standards when multiple annotators agree:
gold_standards:
auto_promote:
enabled: true
min_annotators: 3 # Wait for at least 3 annotators
agreement_threshold: 1.0 # 100% must agree (unanimous)Pre-annotation Support
Pre-annotation allows you to pre-fill annotation forms with model predictions, useful for active learning and correction workflows.
Configuration
pre_annotation:
enabled: true
field: "predictions" # Field in data containing predictions
allow_modification: true # Can annotators change pre-filled values?
show_confidence: true
highlight_low_confidence: 0.7Data Format
Include predictions in your data items:
{
"id": "item_001",
"text": "I love this product!",
"predictions": {
"sentiment": "positive",
"confidence": 0.92
}
}Agreement Metrics
Real-time inter-annotator agreement metrics using Krippendorff's alpha are available in the admin dashboard.
Configuration
agreement_metrics:
enabled: true
min_overlap: 2 # Minimum annotators per item
auto_refresh: true
refresh_interval: 60 # Seconds between updatesInterpreting Krippendorff's Alpha
| Alpha Value | Interpretation |
|---|---|
| α ≥ 0.8 | Good agreement - reliable for most purposes |
| 0.67 ≤ α ≤ 0.8 | Tentative agreement - draw tentative conclusions |
| 0.33 ≤ α ≤ 0.67 | Low agreement - review guidelines |
| α ≤ 0.33 | Poor agreement - significant issues |
Admin Dashboard Integration
View quality control metrics in the admin dashboard at /admin:
- Attention Checks: Overall pass/fail rates, per-annotator statistics
- Gold Standards: Per-annotator accuracy, per-item difficulty analysis
- Agreement: Per-schema Krippendorff's alpha with interpretation
- Auto-Promoted Items: List of items promoted from high agreement
API Endpoints
Quality Control Metrics
GET /admin/api/quality_controlReturns attention check and gold standard statistics.
Agreement Metrics
GET /admin/api/agreementReturns Krippendorff's alpha by schema with interpretation.
Complete Example
annotation_task_name: "Sentiment Analysis with Quality Control"
annotation_schemes:
- name: sentiment
annotation_type: radio
labels: [positive, negative, neutral]
description: "Select the sentiment of the text"
attention_checks:
enabled: true
items_file: "data/attention_checks.json"
frequency: 15
failure_handling:
warn_threshold: 2
block_threshold: 5
gold_standards:
enabled: true
items_file: "data/gold_standards.json"
mode: mixed
frequency: 25
accuracy:
min_threshold: 0.7
evaluation_count: 5
agreement_metrics:
enabled: true
min_overlap: 2
refresh_interval: 60Troubleshooting
Attention checks not appearing
- Verify
items_filepath is correct (relative to task directory) - Check that items have required fields (
id,expected_answer) - Ensure
frequencyorprobabilityis set
Agreement metrics showing "No items with N+ annotators"
- Ensure items have been annotated by multiple users
- Reduce
min_overlapif needed - Check that annotations are being saved correctly
Further Reading
- Training Phase - Annotator qualification
- Admin Dashboard - Monitoring metrics
- Task Assignment - Control annotation distribution
For implementation details, see the source documentation.