Blog/Guides
Guides3 min read

Active Learning: Annotate Smarter, Not Harder

How to use uncertainty sampling to prioritize annotations and reduce the total labeling effort by up to 50%.

By Potato Team·

Active Learning: Annotate Smarter, Not Harder

Active learning intelligently selects which items to annotate next, focusing human effort where it matters most. This guide shows how to reduce annotation effort by up to 50% while maintaining model quality.

What is Active Learning?

Instead of randomly sampling data to annotate, active learning:

  1. Trains a model on current annotations
  2. Identifies items where the model is uncertain
  3. Prioritizes those items for human annotation
  4. Repeats, continuously improving efficiency

Why Use Active Learning?

  • Reduce annotation cost: Label fewer items for same model quality
  • Faster iteration: Get usable models sooner
  • Focus expertise: Human attention on difficult cases
  • Better coverage: Ensure edge cases are represented

Basic Active Learning Setup

annotation_task_name: "Active Learning Classification"
 
data_files:
  - "data/unlabeled_pool.json"
 
# Active learning configuration
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 1000  # Number of instances to reorder by uncertainty
  random_sample_percent: 0.1  # 10% random sampling to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [Positive, Negative, Neutral]

How Uncertainty Sampling Works

Potato's active learning uses uncertainty sampling to prioritize items where the classifier is least confident. The classifier predicts labels for unlabeled instances, and those with the lowest confidence scores are presented first for annotation.

The classifier_name field specifies any scikit-learn compatible classifier using its full module path:

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"

Other classifier options include:

  • sklearn.ensemble.RandomForestClassifier
  • sklearn.svm.SVC (with probability=True)
  • sklearn.naive_bayes.MultinomialNB

Complete Configuration

annotation_task_name: "Active Learning for Sentiment"
 
data_files:
  - "data/reviews.json"
 
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 2000  # Reorder top N by uncertainty
  random_sample_percent: 0.1  # 10% random to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "Classify the sentiment"
    labels:
      - name: Positive
        key_value: "1"
      - name: Negative
        key_value: "2"
      - name: Neutral
        key_value: "3"
    required: true
 
annotation_guidelines:
  text: |
    ## Sentiment Classification
 
    Items are prioritized by model uncertainty.
    You may see more difficult or ambiguous cases.
 
    Focus on accuracy over speed.

Monitoring Progress

Track annotation progress through Potato's built-in logging. The system logs which instances were selected and their uncertainty scores, allowing you to monitor the active learning process.

Best Practices

Cold Start

Start with diverse random sampling by setting a higher random_sample_percent:

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
  random_sample_percent: 0.2  # 20% random for initial diversity

Controlling Reordering Scope

Use max_instances_to_reorder to control how many instances are ranked by uncertainty. A larger value provides better selection but requires more computation:

active_learning:
  max_instances_to_reorder: 5000  # Rank top 5000 by uncertainty

Maintaining Diversity

The random_sample_percent parameter ensures some randomly sampled instances are included, preventing the model from only seeing uncertain edge cases:

active_learning:
  random_sample_percent: 0.1  # 10% random sampling

Tips for Success

  1. Start diverse: Random initial sample covers edge cases
  2. Monitor accuracy: Track model performance over time
  3. Don't over-optimize: Some random sampling maintains coverage
  4. Handle annotator fatigue: Difficult items are tiring
  5. Save model checkpoints: Enable rollback if needed

Next Steps


Full active learning documentation at /docs/features/active-learning.