Quality Control for Crowdsourced Annotation
Best practices for ensuring annotation quality in annotation projects, including practical strategies you can implement with and beyond Potato.
Quality Control for Crowdsourced Annotation
Quality control separates useful annotations from noise. This guide covers proven strategies for ensuring high-quality data from crowdsourced and in-house annotation projects.
Quality Control Overview
Effective quality control combines multiple strategies:
- Attention checks: Verify annotators are engaged with the task
- Redundancy: Collect multiple annotations per item
- Agreement metrics: Measure consistency across annotators
- Training and guidelines: Ensure annotators understand the task
- Manual review: Sample and review annotation quality
Attention Checks via Surveyflow
Potato supports basic attention checks through the surveyflow system. You can insert survey pages between annotation batches that ask annotators to confirm they're paying attention.
annotation_task_name: "Sentiment Annotation with Checks"
surveyflow:
on: true
order:
- survey_instructions
- annotation
- survey_attention_check
- annotation
- survey_completionDefine attention check questions as a survey page:
# In your surveyflow survey definitions
survey_attention_check:
- question: "To confirm you're paying attention, please select 'Strongly Agree'."
type: radio
options:
- Strongly Disagree
- Disagree
- Neutral
- Agree
- Strongly AgreeNote that Potato's built-in attention check support is limited. For more sophisticated attention checks (automatic failure detection, ejecting annotators, etc.), you'll need to implement post-processing scripts or use your crowdsourcing platform's built-in quality features.
Redundancy: Multiple Annotations Per Item
Collecting multiple annotations per item is one of the most reliable quality control methods. Configure this in your data setup:
annotation_task_name: "Multi-Annotator Sentiment Task"
data_files:
- path: data.json
list_as_text: false
sampling: random
# Control how many annotators see each item through assignment logic
# This is typically managed through your annotator assignment systemWhen using crowdsourcing platforms like Prolific, you can:
- Post the same HIT multiple times to get redundant annotations
- Use different worker batches for the same data
- Implement custom assignment logic in your data pipeline
Measuring Inter-Annotator Agreement
While Potato doesn't calculate agreement metrics automatically during annotation, you should calculate them during post-processing. Common metrics include:
Cohen's Kappa (Two Annotators)
For categorical annotations with two annotators:
from sklearn.metrics import cohen_kappa_score
# After collecting annotations
annotator1_labels = ["Positive", "Negative", "Positive", ...]
annotator2_labels = ["Positive", "Negative", "Neutral", ...]
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
print(f"Cohen's Kappa: {kappa:.3f}")Fleiss' Kappa (Multiple Annotators)
For three or more annotators:
from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np
# Build a matrix of label counts per item
# Each row is an item, each column is a label category
ratings_matrix = np.array([
[3, 0, 0], # Item 1: 3 Positive, 0 Negative, 0 Neutral
[2, 1, 0], # Item 2: 2 Positive, 1 Negative, 0 Neutral
[0, 0, 3], # Item 3: 0 Positive, 0 Negative, 3 Neutral
...
])
kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")Interpretation Guidelines
| Kappa Value | Interpretation |
|---|---|
| < 0.20 | Poor agreement |
| 0.21 - 0.40 | Fair agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.81 - 1.00 | Near-perfect agreement |
Gold Standard Items
Gold standard items are pre-labeled items with known correct answers that you mix into your annotation data. This helps identify annotators who may be guessing or not paying attention.
Creating Gold Items
- Create a set of items with clear, unambiguous correct answers
- Have experts label these items
- Mix them into your regular annotation data
[
{
"id": "gold_001",
"text": "I absolutely love this product! Best purchase ever!",
"is_gold": true,
"gold_label": "Positive"
},
{
"id": "gold_002",
"text": "This is terrible. Complete waste of money. Worst experience.",
"is_gold": true,
"gold_label": "Negative"
},
{
"id": "regular_001",
"text": "The product arrived on time and works as expected.",
"is_gold": false
}
]Analyzing Gold Performance
After collection, analyze how each annotator performed on gold items:
import json
def calculate_gold_accuracy(annotations_file, gold_labels):
with open(annotations_file) as f:
annotations = json.load(f)
annotator_scores = {}
for item_id, item_annotations in annotations.items():
if item_id in gold_labels:
expected = gold_labels[item_id]
for annotator, label in item_annotations.items():
if annotator not in annotator_scores:
annotator_scores[annotator] = {'correct': 0, 'total': 0}
annotator_scores[annotator]['total'] += 1
if label == expected:
annotator_scores[annotator]['correct'] += 1
for annotator, scores in annotator_scores.items():
accuracy = scores['correct'] / scores['total']
print(f"{annotator}: {accuracy:.1%} gold accuracy")
return annotator_scoresTime-Based Quality Indicators
Potato tracks annotation timing in the output files. Use this data to flag potentially low-quality annotations:
Analyzing Timing Data
import json
from statistics import mean, stdev
def analyze_timing(annotations_file):
with open(annotations_file) as f:
data = json.load(f)
times = []
for item in data.values():
if 'time_spent' in item:
times.append(item['time_spent'])
avg_time = mean(times)
std_time = stdev(times)
# Flag annotations that are too fast (< 2 std below mean)
threshold = max(avg_time - 2 * std_time, 2) # At least 2 seconds
flagged = [t for t in times if t < threshold]
print(f"Average time: {avg_time:.1f}s")
print(f"Flagged as too fast: {len(flagged)} items")Platform-Level Quality Control
When using crowdsourcing platforms, leverage their built-in quality features:
Prolific
- Use prescreening filters (approval rate, previous studies)
- Set minimum completion time requirements
- Use attention check questions in your pre-survey
- Review submissions before approving payment
MTurk
- Require minimum HIT approval rate (>95%)
- Use qualification tests
- Set up automatic approval/rejection based on criteria
- Block workers who fail quality checks
Post-Processing Quality Checks
Implement automated checks on collected data:
def quality_check_annotations(annotations_file):
with open(annotations_file) as f:
data = json.load(f)
issues = []
for annotator_id, items in group_by_annotator(data).items():
labels = [item['label'] for item in items]
# Check for single-label bias (always selecting same option)
unique_labels = set(labels)
if len(unique_labels) == 1 and len(labels) > 10:
issues.append(f"{annotator_id}: Only used label '{labels[0]}'")
# Check for position bias (always selecting first option)
# Requires knowing option order in your schema
# Check for very fast submissions
times = [item.get('time_spent', 0) for item in items]
avg_time = sum(times) / len(times) if times else 0
if avg_time < 3:
issues.append(f"{annotator_id}: Average time only {avg_time:.1f}s")
return issuesBest Practices
-
Start with training: Use Potato's training phase to onboard annotators before real annotation begins
-
Write clear guidelines: Ambiguous guidelines lead to disagreement that isn't about annotator quality
-
Pilot first: Run a small pilot to identify issues before full deployment
-
Mix check types: Combine attention checks, gold standards, and redundancy
-
Calibrate thresholds: Start with lenient quality thresholds and tighten based on observed data
-
Provide feedback: When possible, give annotators feedback to help them improve
-
Monitor continuously: Quality can drift over time as annotators become fatigued
-
Document decisions: Record how you handle edge cases and quality issues
Summary
Quality control for annotation requires a multi-layered approach:
| Strategy | Implementation | When to Check |
|---|---|---|
| Attention checks | Surveyflow surveys | During annotation |
| Gold standards | Mixed into data | Post-collection |
| Redundancy | Multiple annotators per item | Post-collection |
| Agreement metrics | Python scripts | Post-collection |
| Timing analysis | Annotation timestamps | Post-collection |
| Platform features | Prolific/MTurk settings | Before/during collection |
Most quality control analysis happens after data collection through post-processing scripts. Plan your analysis pipeline before collecting data to ensure you capture the information you need.
Next Steps
- Learn about inter-annotator agreement calculations in detail
- Set up Prolific integration for crowdsourced annotation
- Configure training phases for annotator onboarding
For more on annotation workflows, see the annotation schemes documentation.