Quality Control for Crowdsourced Annotation

Quality control separates useful annotations from noise. This guide covers proven strategies for ensuring high-quality data from crowdsourced and in-house annotation projects.

Quality Control Overview

Effective quality control combines multiple strategies:

Attention checks: Verify annotators are engaged with the task
Redundancy: Collect multiple annotations per item
Agreement metrics: Measure consistency across annotators
Training and guidelines: Ensure annotators understand the task
Manual review: Sample and review annotation quality

Attention Checks via Surveyflow

Potato supports basic attention checks through the surveyflow system. You can insert survey pages between annotation batches that ask annotators to confirm they're paying attention.

yaml

annotation_task_name: "Sentiment Annotation with Checks"
 
surveyflow:
  on: true
  order:
    - survey_instructions
    - annotation
    - survey_attention_check
    - annotation
    - survey_completion

Define attention check questions as a survey page:

yaml

# In your surveyflow survey definitions
survey_attention_check:
  - question: "To confirm you're paying attention, please select 'Strongly Agree'."
    type: radio
    options:
      - Strongly Disagree
      - Disagree
      - Neutral
      - Agree
      - Strongly Agree

Note that Potato's built-in attention check support is limited. For more sophisticated attention checks (automatic failure detection, ejecting annotators, etc.), you'll need to implement post-processing scripts or use your crowdsourcing platform's built-in quality features.

Redundancy: Multiple Annotations Per Item

Collecting multiple annotations per item is one of the most reliable quality control methods. Configure this in your data setup:

yaml

annotation_task_name: "Multi-Annotator Sentiment Task"
 
data_files:
  - path: data.json
    list_as_text: false
    sampling: random
 
# Control how many annotators see each item through assignment logic
# This is typically managed through your annotator assignment system

When using crowdsourcing platforms like Prolific, you can:

Post the same HIT multiple times to get redundant annotations
Use different worker batches for the same data
Implement custom assignment logic in your data pipeline

Measuring Inter-Annotator Agreement

While Potato doesn't calculate agreement metrics automatically during annotation, you should calculate them during post-processing. Common metrics include:

Cohen's Kappa (Two Annotators)

For categorical annotations with two annotators:

python

from sklearn.metrics import cohen_kappa_score
 
# After collecting annotations
annotator1_labels = ["Positive", "Negative", "Positive", ...]
annotator2_labels = ["Positive", "Negative", "Neutral", ...]
 
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
print(f"Cohen's Kappa: {kappa:.3f}")

Fleiss' Kappa (Multiple Annotators)

For three or more annotators:

python

from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np
 
# Build a matrix of label counts per item
# Each row is an item, each column is a label category
ratings_matrix = np.array([
    [3, 0, 0],  # Item 1: 3 Positive, 0 Negative, 0 Neutral
    [2, 1, 0],  # Item 2: 2 Positive, 1 Negative, 0 Neutral
    [0, 0, 3],  # Item 3: 0 Positive, 0 Negative, 3 Neutral
    ...
])
 
kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")

Interpretation Guidelines

Kappa Value	Interpretation
< 0.20	Poor agreement
0.21 - 0.40	Fair agreement
0.41 - 0.60	Moderate agreement
0.61 - 0.80	Substantial agreement
0.81 - 1.00	Near-perfect agreement

Gold Standard Items

Gold standard items are pre-labeled items with known correct answers that you mix into your annotation data. This helps identify annotators who may be guessing or not paying attention.

Creating Gold Items

Create a set of items with clear, unambiguous correct answers
Have experts label these items
Mix them into your regular annotation data

json

[
  {
    "id": "gold_001",
    "text": "I absolutely love this product! Best purchase ever!",
    "is_gold": true,
    "gold_label": "Positive"
  },
  {
    "id": "gold_002",
    "text": "This is terrible. Complete waste of money. Worst experience.",
    "is_gold": true,
    "gold_label": "Negative"
  },
  {
    "id": "regular_001",
    "text": "The product arrived on time and works as expected.",
    "is_gold": false
  }
]

Analyzing Gold Performance

After collection, analyze how each annotator performed on gold items:

python

import json
 
def calculate_gold_accuracy(annotations_file, gold_labels):
    with open(annotations_file) as f:
        annotations = json.load(f)
 
    annotator_scores = {}
 
    for item_id, item_annotations in annotations.items():
        if item_id in gold_labels:
            expected = gold_labels[item_id]
            for annotator, label in item_annotations.items():
                if annotator not in annotator_scores:
                    annotator_scores[annotator] = {'correct': 0, 'total': 0}
                annotator_scores[annotator]['total'] += 1
                if label == expected:
                    annotator_scores[annotator]['correct'] += 1
 
    for annotator, scores in annotator_scores.items():
        accuracy = scores['correct'] / scores['total']
        print(f"{annotator}: {accuracy:.1%} gold accuracy")
 
    return annotator_scores

Time-Based Quality Indicators

Potato tracks annotation timing in the output files. Use this data to flag potentially low-quality annotations:

Analyzing Timing Data

python

import json
from statistics import mean, stdev
 
def analyze_timing(annotations_file):
    with open(annotations_file) as f:
        data = json.load(f)
 
    times = []
    for item in data.values():
        if 'time_spent' in item:
            times.append(item['time_spent'])
 
    avg_time = mean(times)
    std_time = stdev(times)
 
    # Flag annotations that are too fast (< 2 std below mean)
    threshold = max(avg_time - 2 * std_time, 2)  # At least 2 seconds
 
    flagged = [t for t in times if t < threshold]
    print(f"Average time: {avg_time:.1f}s")
    print(f"Flagged as too fast: {len(flagged)} items")

Platform-Level Quality Control

When using crowdsourcing platforms, leverage their built-in quality features:

Prolific

Use prescreening filters (approval rate, previous studies)
Set minimum completion time requirements
Use attention check questions in your pre-survey
Review submissions before approving payment

MTurk

Require minimum HIT approval rate (>95%)
Use qualification tests
Set up automatic approval/rejection based on criteria
Block workers who fail quality checks

Post-Processing Quality Checks

Implement automated checks on collected data:

python

def quality_check_annotations(annotations_file):
    with open(annotations_file) as f:
        data = json.load(f)
 
    issues = []
 
    for annotator_id, items in group_by_annotator(data).items():
        labels = [item['label'] for item in items]
 
        # Check for single-label bias (always selecting same option)
        unique_labels = set(labels)
        if len(unique_labels) == 1 and len(labels) > 10:
            issues.append(f"{annotator_id}: Only used label '{labels[0]}'")
 
        # Check for position bias (always selecting first option)
        # Requires knowing option order in your schema
 
        # Check for very fast submissions
        times = [item.get('time_spent', 0) for item in items]
        avg_time = sum(times) / len(times) if times else 0
        if avg_time < 3:
            issues.append(f"{annotator_id}: Average time only {avg_time:.1f}s")
 
    return issues

Best Practices

Start with training: Use Potato's training phase to onboard annotators before real annotation begins
Write clear guidelines: Ambiguous guidelines lead to disagreement that isn't about annotator quality
Pilot first: Run a small pilot to identify issues before full deployment
Mix check types: Combine attention checks, gold standards, and redundancy
Calibrate thresholds: Start with lenient quality thresholds and tighten based on observed data
Provide feedback: When possible, give annotators feedback to help them improve
Monitor continuously: Quality can drift over time as annotators become fatigued
Document decisions: Record how you handle edge cases and quality issues

Summary

Quality control for annotation requires a multi-layered approach:

Strategy	Implementation	When to Check
Attention checks	Surveyflow surveys	During annotation
Gold standards	Mixed into data	Post-collection
Redundancy	Multiple annotators per item	Post-collection
Agreement metrics	Python scripts	Post-collection
Timing analysis	Annotation timestamps	Post-collection
Platform features	Prolific/MTurk settings	Before/during collection

Most quality control analysis happens after data collection through post-processing scripts. Plan your analysis pipeline before collecting data to ensure you capture the information you need.

Next Steps

Learn about inter-annotator agreement calculations in detail
Set up Prolific integration for crowdsourced annotation
Configure training phases for annotator onboarding

For more on annotation workflows, see the annotation schemes documentation.