Quality Control for Crowdsourced Annotation
Best practices for ensuring annotation quality in annotation projects, including practical strategies you can implement with and beyond Potato.
Quality control is what separates useful annotations from noise. This guide covers strategies that hold up in practice for crowdsourced and in-house annotation projects. For the underlying features, see the quality control documentation.
Quality Control Overview
No single check is enough on its own, so most projects layer a few together:
- Attention checks: Verify annotators are engaged with the task
- Redundancy: Collect multiple annotations per item
- Agreement metrics: Measure consistency across annotators
- Training and guidelines: Ensure annotators understand the task
- Manual review: Sample and review annotation quality
Attention Checks via Surveyflow
Potato supports basic attention checks through the surveyflow system. You can insert survey pages between annotation batches that ask annotators to confirm they're paying attention.
annotation_task_name: "Sentiment Annotation with Checks"
surveyflow:
on: true
order:
- survey_instructions
- annotation
- survey_attention_check
- annotation
- survey_completionDefine attention check questions as a survey page:
# In your surveyflow survey definitions
survey_attention_check:
- question: "To confirm you're paying attention, please select 'Strongly Agree'."
type: radio
options:
- Strongly Disagree
- Disagree
- Neutral
- Agree
- Strongly AgreePotato's built-in attention check support is limited. If you want automatic failure detection, ejecting annotators, and similar logic, you'll need post-processing scripts or your crowdsourcing platform's own quality features.
Redundancy: Multiple Annotations Per Item
Collecting multiple annotations per item is one of the more reliable quality control methods. Configure it in your data setup:
annotation_task_name: "Multi-Annotator Sentiment Task"
data_files:
- path: data.json
list_as_text: false
sampling: random
# Control how many annotators see each item through assignment logic
# This is typically managed through your annotator assignment systemWhen using crowdsourcing platforms like Prolific, you can:
- Post the same HIT multiple times to get redundant annotations
- Use different worker batches for the same data
- Implement custom assignment logic in your data pipeline
Measuring Inter-Annotator Agreement
While Potato doesn't calculate agreement metrics automatically during annotation, you should calculate them during post-processing. Common metrics include:
Cohen's Kappa (Two Annotators)
For categorical annotations with two annotators:
from sklearn.metrics import cohen_kappa_score
# After collecting annotations
annotator1_labels = ["Positive", "Negative", "Positive", ...]
annotator2_labels = ["Positive", "Negative", "Neutral", ...]
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
print(f"Cohen's Kappa: {kappa:.3f}")Fleiss' Kappa (Multiple Annotators)
For three or more annotators:
from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np
# Build a matrix of label counts per item
# Each row is an item, each column is a label category
ratings_matrix = np.array([
[3, 0, 0], # Item 1: 3 Positive, 0 Negative, 0 Neutral
[2, 1, 0], # Item 2: 2 Positive, 1 Negative, 0 Neutral
[0, 0, 3], # Item 3: 0 Positive, 0 Negative, 3 Neutral
...
])
kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")Interpretation Guidelines
| Kappa Value | Interpretation |
|---|---|
| < 0.20 | Poor agreement |
| 0.21 - 0.40 | Fair agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.81 - 1.00 | Near-perfect agreement |
Potato supports attention checks, gold items, and inter-annotator agreement tracking to maintain annotation quality:

Gold Standard Items
Gold standard items are pre-labeled items with known correct answers that you mix into your data. They help you catch annotators who are guessing or not paying attention.
Creating Gold Items
- Create a set of items with clear, unambiguous correct answers
- Have experts label these items
- Mix them into your regular annotation data
[
{
"id": "gold_001",
"text": "I absolutely love this product! Best purchase ever!",
"is_gold": true,
"gold_label": "Positive"
},
{
"id": "gold_002",
"text": "This is terrible. Complete waste of money. Worst experience.",
"is_gold": true,
"gold_label": "Negative"
},
{
"id": "regular_001",
"text": "The product arrived on time and works as expected.",
"is_gold": false
}
]Analyzing Gold Performance
After collection, analyze how each annotator performed on gold items:
import json
def calculate_gold_accuracy(annotations_file, gold_labels):
with open(annotations_file) as f:
annotations = json.load(f)
annotator_scores = {}
for item_id, item_annotations in annotations.items():
if item_id in gold_labels:
expected = gold_labels[item_id]
for annotator, label in item_annotations.items():
if annotator not in annotator_scores:
annotator_scores[annotator] = {'correct': 0, 'total': 0}
annotator_scores[annotator]['total'] += 1
if label == expected:
annotator_scores[annotator]['correct'] += 1
for annotator, scores in annotator_scores.items():
accuracy = scores['correct'] / scores['total']
print(f"{annotator}: {accuracy:.1%} gold accuracy")
return annotator_scoresTime-Based Quality Indicators
Potato tracks annotation timing in the output files. Use this data to flag potentially low-quality annotations:
Analyzing Timing Data
import json
from statistics import mean, stdev
def analyze_timing(annotations_file):
with open(annotations_file) as f:
data = json.load(f)
times = []
for item in data.values():
if 'time_spent' in item:
times.append(item['time_spent'])
avg_time = mean(times)
std_time = stdev(times)
# Flag annotations that are too fast (< 2 std below mean)
threshold = max(avg_time - 2 * std_time, 2) # At least 2 seconds
flagged = [t for t in times if t < threshold]
print(f"Average time: {avg_time:.1f}s")
print(f"Flagged as too fast: {len(flagged)} items")Platform-Level Quality Control
When using crowdsourcing platforms, leverage their built-in quality features:
Prolific
- Use prescreening filters (approval rate, previous studies)
- Set minimum completion time requirements
- Use attention check questions in your pre-survey
- Review submissions before approving payment
MTurk
- Require minimum HIT approval rate (>95%)
- Use qualification tests
- Set up automatic approval/rejection based on criteria
- Block workers who fail quality checks
Post-Processing Quality Checks
Implement automated checks on collected data:
def quality_check_annotations(annotations_file):
with open(annotations_file) as f:
data = json.load(f)
issues = []
for annotator_id, items in group_by_annotator(data).items():
labels = [item['label'] for item in items]
# Check for single-label bias (always selecting same option)
unique_labels = set(labels)
if len(unique_labels) == 1 and len(labels) > 10:
issues.append(f"{annotator_id}: Only used label '{labels[0]}'")
# Check for position bias (always selecting first option)
# Requires knowing option order in your schema
# Check for very fast submissions
times = [item.get('time_spent', 0) for item in items]
avg_time = sum(times) / len(times) if times else 0
if avg_time < 3:
issues.append(f"{annotator_id}: Average time only {avg_time:.1f}s")
return issuesBest Practices
Onboard annotators with Potato's training phase before any real annotation starts. Write clear guidelines, since ambiguous ones cause disagreement that has nothing to do with annotator quality. Run a small pilot to surface problems before you deploy at scale. Mix your checks rather than relying on one: attention checks, gold standards, and redundancy each catch different things. Start with lenient thresholds and tighten them once you've seen real data. Give annotators feedback when you can. Keep monitoring, because quality drifts as people get tired. And write down how you handle edge cases so your decisions stay consistent.
For setting up the onboarding step specifically, see the training phase documentation.
Summary
Quality control for annotation works best in layers:
| Strategy | Implementation | When to Check |
|---|---|---|
| Attention checks | Surveyflow surveys | During annotation |
| Gold standards | Mixed into data | Post-collection |
| Redundancy | Multiple annotators per item | Post-collection |
| Agreement metrics | Python scripts | Post-collection |
| Timing analysis | Annotation timestamps | Post-collection |
| Platform features | Prolific/MTurk settings | Before/during collection |
Most of this analysis happens after collection, through post-processing scripts. Plan that pipeline before you collect anything, so you actually capture the fields you'll need.
Next Steps
- Learn about inter-annotator agreement calculations in detail
- Set up Prolific integration for crowdsourced annotation
- Configure training phases for annotator onboarding
For more on annotation workflows, see the annotation schemes documentation.