Active Learning: Annotate Smarter, Not Harder

Active learning intelligently selects which items to annotate next, focusing human effort where it matters most. This guide shows how to reduce annotation effort by up to 50% while maintaining model quality.

What is Active Learning?

Instead of randomly sampling data to annotate, active learning:

Trains a model on current annotations
Identifies items where the model is uncertain
Prioritizes those items for human annotation
Repeats, continuously improving efficiency

Why Use Active Learning?

Reduce annotation cost: Label fewer items for same model quality
Faster iteration: Get usable models sooner
Focus expertise: Human attention on difficult cases
Better coverage: Ensure edge cases are represented

Basic Active Learning Setup

yaml

annotation_task_name: "Active Learning Classification"
 
data_files:
  - "data/unlabeled_pool.json"
 
# Active learning configuration
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 1000  # Number of instances to reorder by uncertainty
  random_sample_percent: 0.1  # 10% random sampling to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [Positive, Negative, Neutral]

How Uncertainty Sampling Works

Potato's active learning uses uncertainty sampling to prioritize items where the classifier is least confident. The classifier predicts labels for unlabeled instances, and those with the lowest confidence scores are presented first for annotation.

The classifier_name field specifies any scikit-learn compatible classifier using its full module path:

yaml

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"

Other classifier options include:

sklearn.ensemble.RandomForestClassifier
sklearn.svm.SVC (with probability=True)
sklearn.naive_bayes.MultinomialNB

Complete Configuration

yaml

annotation_task_name: "Active Learning for Sentiment"
 
data_files:
  - "data/reviews.json"
 
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 2000  # Reorder top N by uncertainty
  random_sample_percent: 0.1  # 10% random to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "Classify the sentiment"
    labels:
      - name: Positive
        key_value: "1"
      - name: Negative
        key_value: "2"
      - name: Neutral
        key_value: "3"
    required: true
 
annotation_guidelines:
  text: |
    ## Sentiment Classification
 
    Items are prioritized by model uncertainty.
    You may see more difficult or ambiguous cases.
 
    Focus on accuracy over speed.

Monitoring Progress

Track annotation progress through Potato's built-in logging. The system logs which instances were selected and their uncertainty scores, allowing you to monitor the active learning process.

Best Practices

Cold Start

Start with diverse random sampling by setting a higher random_sample_percent:

yaml

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
  random_sample_percent: 0.2  # 20% random for initial diversity

Controlling Reordering Scope

Use max_instances_to_reorder to control how many instances are ranked by uncertainty. A larger value provides better selection but requires more computation:

yaml

active_learning:
  max_instances_to_reorder: 5000  # Rank top 5000 by uncertainty

Maintaining Diversity

The random_sample_percent parameter ensures some randomly sampled instances are included, preventing the model from only seeing uncertain edge cases:

yaml

active_learning:
  random_sample_percent: 0.1  # 10% random sampling

Tips for Success

Start diverse: Random initial sample covers edge cases
Monitor accuracy: Track model performance over time
Don't over-optimize: Some random sampling maintains coverage
Handle annotator fatigue: Difficult items are tiring
Save model checkpoints: Enable rollback if needed

Next Steps

Add AI suggestions to speed up uncertain items
Set up quality control for difficult cases
Learn about crowdsourcing with active learning

Full active learning documentation at /docs/features/active-learning.