# Quality Control for Crowdsourced Annotation

Source: https://www.potatoannotator.com/blog/quality-control-strategies

Quality control is what separates useful annotations from noise. This guide covers strategies that hold up in practice for crowdsourced and in-house annotation projects. For the underlying features, see the [quality control documentation](https://github.com/davidjurgens/potato/blob/master/docs/workflow/quality_control.md).

## Quality Control Overview

No single check is enough on its own, so most projects layer a few together:

1. **Attention checks**: Verify annotators are engaged with the task
2. **Redundancy**: Collect multiple annotations per item
3. **Agreement metrics**: Measure consistency across annotators
4. **Training and guidelines**: Ensure annotators understand the task
5. **Manual review**: Sample and review annotation quality

## Attention Checks via Surveyflow

Potato supports basic attention checks through the surveyflow system. You can insert survey pages between annotation batches that ask annotators to confirm they're paying attention.

```yaml
annotation_task_name: "Sentiment Annotation with Checks"

surveyflow:
  on: true
  order:
    - survey_instructions
    - annotation
    - survey_attention_check
    - annotation
    - survey_completion
```

Define attention check questions as a survey page:

```yaml
# In your surveyflow survey definitions
survey_attention_check:
  - question: "To confirm you're paying attention, please select 'Strongly Agree'."
    type: radio
    options:
      - Strongly Disagree
      - Disagree
      - Neutral
      - Agree
      - Strongly Agree
```

Potato's built-in attention check support is limited. If you want automatic failure detection, ejecting annotators, and similar logic, you'll need post-processing scripts or your crowdsourcing platform's own quality features.

## Redundancy: Multiple Annotations Per Item

Collecting multiple annotations per item is one of the more reliable quality control methods. Configure it in your data setup:

```yaml
annotation_task_name: "Multi-Annotator Sentiment Task"

data_files:
  - path: data.json
    list_as_text: false
    sampling: random

# Control how many annotators see each item through assignment logic
# This is typically managed through your annotator assignment system
```

When using crowdsourcing platforms like Prolific, you can:
- Post the same HIT multiple times to get redundant annotations
- Use different worker batches for the same data
- Implement custom assignment logic in your data pipeline

## Measuring Inter-Annotator Agreement

While Potato doesn't calculate agreement metrics automatically during annotation, you should calculate them during post-processing. Common metrics include:

### Cohen's Kappa (Two Annotators)

For categorical annotations with two annotators:

```python
from sklearn.metrics import cohen_kappa_score

# After collecting annotations
annotator1_labels = ["Positive", "Negative", "Positive", ...]
annotator2_labels = ["Positive", "Negative", "Neutral", ...]

kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
print(f"Cohen's Kappa: {kappa:.3f}")
```

### Fleiss' Kappa (Multiple Annotators)

For three or more annotators:

```python
from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np

# Build a matrix of label counts per item
# Each row is an item, each column is a label category
ratings_matrix = np.array([
    [3, 0, 0],  # Item 1: 3 Positive, 0 Negative, 0 Neutral
    [2, 1, 0],  # Item 2: 2 Positive, 1 Negative, 0 Neutral
    [0, 0, 3],  # Item 3: 0 Positive, 0 Negative, 3 Neutral
    ...
])

kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")
```

### Interpretation Guidelines

| Kappa Value | Interpretation |
|-------------|----------------|
| < 0.20 | Poor agreement |
| 0.21 - 0.40 | Fair agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.81 - 1.00 | Near-perfect agreement |

Potato supports attention checks, gold items, and inter-annotator agreement tracking to maintain annotation quality:

![Likert scale quality control in Potato](/images/blog/likert-rating.png)

## Gold Standard Items

Gold standard items are pre-labeled items with known correct answers that you mix into your data. They help you catch annotators who are guessing or not paying attention.

### Creating Gold Items

1. Create a set of items with clear, unambiguous correct answers
2. Have experts label these items
3. Mix them into your regular annotation data

```json
[
  {
    "id": "gold_001",
    "text": "I absolutely love this product! Best purchase ever!",
    "is_gold": true,
    "gold_label": "Positive"
  },
  {
    "id": "gold_002",
    "text": "This is terrible. Complete waste of money. Worst experience.",
    "is_gold": true,
    "gold_label": "Negative"
  },
  {
    "id": "regular_001",
    "text": "The product arrived on time and works as expected.",
    "is_gold": false
  }
]
```

### Analyzing Gold Performance

After collection, analyze how each annotator performed on gold items:

```python
import json

def calculate_gold_accuracy(annotations_file, gold_labels):
    with open(annotations_file) as f:
        annotations = json.load(f)

    annotator_scores = {}

    for item_id, item_annotations in annotations.items():
        if item_id in gold_labels:
            expected = gold_labels[item_id]
            for annotator, label in item_annotations.items():
                if annotator not in annotator_scores:
                    annotator_scores[annotator] = {'correct': 0, 'total': 0}
                annotator_scores[annotator]['total'] += 1
                if label == expected:
                    annotator_scores[annotator]['correct'] += 1

    for annotator, scores in annotator_scores.items():
        accuracy = scores['correct'] / scores['total']
        print(f"{annotator}: {accuracy:.1%} gold accuracy")

    return annotator_scores
```

## Time-Based Quality Indicators

Potato tracks annotation timing in the output files. Use this data to flag potentially low-quality annotations:

### Analyzing Timing Data

```python
import json
from statistics import mean, stdev

def analyze_timing(annotations_file):
    with open(annotations_file) as f:
        data = json.load(f)

    times = []
    for item in data.values():
        if 'time_spent' in item:
            times.append(item['time_spent'])

    avg_time = mean(times)
    std_time = stdev(times)

    # Flag annotations that are too fast (< 2 std below mean)
    threshold = max(avg_time - 2 * std_time, 2)  # At least 2 seconds

    flagged = [t for t in times if t < threshold]
    print(f"Average time: {avg_time:.1f}s")
    print(f"Flagged as too fast: {len(flagged)} items")
```

## Platform-Level Quality Control

When using crowdsourcing platforms, leverage their built-in quality features:

### Prolific

- Use prescreening filters (approval rate, previous studies)
- Set minimum completion time requirements
- Use attention check questions in your pre-survey
- Review submissions before approving payment

### MTurk

- Require minimum HIT approval rate (>95%)
- Use qualification tests
- Set up automatic approval/rejection based on criteria
- Block workers who fail quality checks

## Post-Processing Quality Checks

Implement automated checks on collected data:

```python
def quality_check_annotations(annotations_file):
    with open(annotations_file) as f:
        data = json.load(f)

    issues = []

    for annotator_id, items in group_by_annotator(data).items():
        labels = [item['label'] for item in items]

        # Check for single-label bias (always selecting same option)
        unique_labels = set(labels)
        if len(unique_labels) == 1 and len(labels) > 10:
            issues.append(f"{annotator_id}: Only used label '{labels[0]}'")

        # Check for position bias (always selecting first option)
        # Requires knowing option order in your schema

        # Check for very fast submissions
        times = [item.get('time_spent', 0) for item in items]
        avg_time = sum(times) / len(times) if times else 0
        if avg_time < 3:
            issues.append(f"{annotator_id}: Average time only {avg_time:.1f}s")

    return issues
```

## Best Practices

Onboard annotators with Potato's training phase before any real annotation starts. Write clear guidelines, since ambiguous ones cause disagreement that has nothing to do with annotator quality. Run a small pilot to surface problems before you deploy at scale. Mix your checks rather than relying on one: attention checks, gold standards, and redundancy each catch different things. Start with lenient thresholds and tighten them once you've seen real data. Give annotators feedback when you can. Keep monitoring, because quality drifts as people get tired. And write down how you handle edge cases so your decisions stay consistent.

For setting up the onboarding step specifically, see the [training phase documentation](https://github.com/davidjurgens/potato/blob/master/docs/workflow/training_phase.md).

## Summary

Quality control for annotation works best in layers:

| Strategy | Implementation | When to Check |
|----------|---------------|---------------|
| Attention checks | Surveyflow surveys | During annotation |
| Gold standards | Mixed into data | Post-collection |
| Redundancy | Multiple annotators per item | Post-collection |
| Agreement metrics | Python scripts | Post-collection |
| Timing analysis | Annotation timestamps | Post-collection |
| Platform features | Prolific/MTurk settings | Before/during collection |

Most of this analysis happens after collection, through post-processing scripts. Plan that pipeline before you collect anything, so you actually capture the fields you'll need.

## Next Steps

- Learn about [inter-annotator agreement](/blog/inter-annotator-agreement) calculations in detail
- Set up [Prolific integration](/blog/prolific-integration) for crowdsourced annotation
- Configure [training phases](/docs/features/training-phase) for annotator onboarding

---

*For more on annotation workflows, see the [annotation schemes documentation](/docs/core-concepts/annotation-schemes).*