Blog/Guides
Guides7 min read

Quality Control for Crowdsourced Annotation

Best practices for ensuring annotation quality in annotation projects, including practical strategies you can implement with and beyond Potato.

By Potato Teamยท

Quality Control for Crowdsourced Annotation

Quality control separates useful annotations from noise. This guide covers proven strategies for ensuring high-quality data from crowdsourced and in-house annotation projects.

Quality Control Overview

Effective quality control combines multiple strategies:

  1. Attention checks: Verify annotators are engaged with the task
  2. Redundancy: Collect multiple annotations per item
  3. Agreement metrics: Measure consistency across annotators
  4. Training and guidelines: Ensure annotators understand the task
  5. Manual review: Sample and review annotation quality

Attention Checks via Surveyflow

Potato supports basic attention checks through the surveyflow system. You can insert survey pages between annotation batches that ask annotators to confirm they're paying attention.

annotation_task_name: "Sentiment Annotation with Checks"
 
surveyflow:
  on: true
  order:
    - survey_instructions
    - annotation
    - survey_attention_check
    - annotation
    - survey_completion

Define attention check questions as a survey page:

# In your surveyflow survey definitions
survey_attention_check:
  - question: "To confirm you're paying attention, please select 'Strongly Agree'."
    type: radio
    options:
      - Strongly Disagree
      - Disagree
      - Neutral
      - Agree
      - Strongly Agree

Note that Potato's built-in attention check support is limited. For more sophisticated attention checks (automatic failure detection, ejecting annotators, etc.), you'll need to implement post-processing scripts or use your crowdsourcing platform's built-in quality features.

Redundancy: Multiple Annotations Per Item

Collecting multiple annotations per item is one of the most reliable quality control methods. Configure this in your data setup:

annotation_task_name: "Multi-Annotator Sentiment Task"
 
data_files:
  - path: data.json
    list_as_text: false
    sampling: random
 
# Control how many annotators see each item through assignment logic
# This is typically managed through your annotator assignment system

When using crowdsourcing platforms like Prolific, you can:

  • Post the same HIT multiple times to get redundant annotations
  • Use different worker batches for the same data
  • Implement custom assignment logic in your data pipeline

Measuring Inter-Annotator Agreement

While Potato doesn't calculate agreement metrics automatically during annotation, you should calculate them during post-processing. Common metrics include:

Cohen's Kappa (Two Annotators)

For categorical annotations with two annotators:

from sklearn.metrics import cohen_kappa_score
 
# After collecting annotations
annotator1_labels = ["Positive", "Negative", "Positive", ...]
annotator2_labels = ["Positive", "Negative", "Neutral", ...]
 
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
print(f"Cohen's Kappa: {kappa:.3f}")

Fleiss' Kappa (Multiple Annotators)

For three or more annotators:

from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np
 
# Build a matrix of label counts per item
# Each row is an item, each column is a label category
ratings_matrix = np.array([
    [3, 0, 0],  # Item 1: 3 Positive, 0 Negative, 0 Neutral
    [2, 1, 0],  # Item 2: 2 Positive, 1 Negative, 0 Neutral
    [0, 0, 3],  # Item 3: 0 Positive, 0 Negative, 3 Neutral
    ...
])
 
kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")

Interpretation Guidelines

Kappa ValueInterpretation
< 0.20Poor agreement
0.21 - 0.40Fair agreement
0.41 - 0.60Moderate agreement
0.61 - 0.80Substantial agreement
0.81 - 1.00Near-perfect agreement

Gold Standard Items

Gold standard items are pre-labeled items with known correct answers that you mix into your annotation data. This helps identify annotators who may be guessing or not paying attention.

Creating Gold Items

  1. Create a set of items with clear, unambiguous correct answers
  2. Have experts label these items
  3. Mix them into your regular annotation data
[
  {
    "id": "gold_001",
    "text": "I absolutely love this product! Best purchase ever!",
    "is_gold": true,
    "gold_label": "Positive"
  },
  {
    "id": "gold_002",
    "text": "This is terrible. Complete waste of money. Worst experience.",
    "is_gold": true,
    "gold_label": "Negative"
  },
  {
    "id": "regular_001",
    "text": "The product arrived on time and works as expected.",
    "is_gold": false
  }
]

Analyzing Gold Performance

After collection, analyze how each annotator performed on gold items:

import json
 
def calculate_gold_accuracy(annotations_file, gold_labels):
    with open(annotations_file) as f:
        annotations = json.load(f)
 
    annotator_scores = {}
 
    for item_id, item_annotations in annotations.items():
        if item_id in gold_labels:
            expected = gold_labels[item_id]
            for annotator, label in item_annotations.items():
                if annotator not in annotator_scores:
                    annotator_scores[annotator] = {'correct': 0, 'total': 0}
                annotator_scores[annotator]['total'] += 1
                if label == expected:
                    annotator_scores[annotator]['correct'] += 1
 
    for annotator, scores in annotator_scores.items():
        accuracy = scores['correct'] / scores['total']
        print(f"{annotator}: {accuracy:.1%} gold accuracy")
 
    return annotator_scores

Time-Based Quality Indicators

Potato tracks annotation timing in the output files. Use this data to flag potentially low-quality annotations:

Analyzing Timing Data

import json
from statistics import mean, stdev
 
def analyze_timing(annotations_file):
    with open(annotations_file) as f:
        data = json.load(f)
 
    times = []
    for item in data.values():
        if 'time_spent' in item:
            times.append(item['time_spent'])
 
    avg_time = mean(times)
    std_time = stdev(times)
 
    # Flag annotations that are too fast (< 2 std below mean)
    threshold = max(avg_time - 2 * std_time, 2)  # At least 2 seconds
 
    flagged = [t for t in times if t < threshold]
    print(f"Average time: {avg_time:.1f}s")
    print(f"Flagged as too fast: {len(flagged)} items")

Platform-Level Quality Control

When using crowdsourcing platforms, leverage their built-in quality features:

Prolific

  • Use prescreening filters (approval rate, previous studies)
  • Set minimum completion time requirements
  • Use attention check questions in your pre-survey
  • Review submissions before approving payment

MTurk

  • Require minimum HIT approval rate (>95%)
  • Use qualification tests
  • Set up automatic approval/rejection based on criteria
  • Block workers who fail quality checks

Post-Processing Quality Checks

Implement automated checks on collected data:

def quality_check_annotations(annotations_file):
    with open(annotations_file) as f:
        data = json.load(f)
 
    issues = []
 
    for annotator_id, items in group_by_annotator(data).items():
        labels = [item['label'] for item in items]
 
        # Check for single-label bias (always selecting same option)
        unique_labels = set(labels)
        if len(unique_labels) == 1 and len(labels) > 10:
            issues.append(f"{annotator_id}: Only used label '{labels[0]}'")
 
        # Check for position bias (always selecting first option)
        # Requires knowing option order in your schema
 
        # Check for very fast submissions
        times = [item.get('time_spent', 0) for item in items]
        avg_time = sum(times) / len(times) if times else 0
        if avg_time < 3:
            issues.append(f"{annotator_id}: Average time only {avg_time:.1f}s")
 
    return issues

Best Practices

  1. Start with training: Use Potato's training phase to onboard annotators before real annotation begins

  2. Write clear guidelines: Ambiguous guidelines lead to disagreement that isn't about annotator quality

  3. Pilot first: Run a small pilot to identify issues before full deployment

  4. Mix check types: Combine attention checks, gold standards, and redundancy

  5. Calibrate thresholds: Start with lenient quality thresholds and tighten based on observed data

  6. Provide feedback: When possible, give annotators feedback to help them improve

  7. Monitor continuously: Quality can drift over time as annotators become fatigued

  8. Document decisions: Record how you handle edge cases and quality issues

Summary

Quality control for annotation requires a multi-layered approach:

StrategyImplementationWhen to Check
Attention checksSurveyflow surveysDuring annotation
Gold standardsMixed into dataPost-collection
RedundancyMultiple annotators per itemPost-collection
Agreement metricsPython scriptsPost-collection
Timing analysisAnnotation timestampsPost-collection
Platform featuresProlific/MTurk settingsBefore/during collection

Most quality control analysis happens after data collection through post-processing scripts. Plan your analysis pipeline before collecting data to ensure you capture the information you need.

Next Steps


For more on annotation workflows, see the annotation schemes documentation.