Skip to content
Guides4 min read

Active Learning: Annotate Smarter, Not Harder

Cut annotation effort by up to 50% using Potato's active learning, uncertainty sampling, BADGE, and BALD strategies for prioritizing the most informative unlabeled examples.

Potato Team

Active learning decides which items to annotate next instead of leaving that to chance. Done well, it can cut the labeling you need by roughly half without hurting model quality. Here is how to set it up in Potato.

What is Active Learning?

Instead of randomly sampling data to annotate, active learning:

  1. Trains a model on current annotations
  2. Identifies items where the model is uncertain
  3. Prioritizes those items for human annotation
  4. Repeats, continuously improving efficiency

Why Use Active Learning?

The payoff is mostly about where your annotators spend their time. You label fewer items to reach the same model quality, so you get a usable model sooner. The hard, ambiguous cases get human attention, and edge cases that random sampling would skip over tend to surface earlier.

Basic Active Learning Setup

yaml
annotation_task_name: "Active Learning Classification"
 
data_files:
  - "data/unlabeled_pool.json"
 
# Active learning configuration
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 1000  # Number of instances to reorder by uncertainty
  random_sample_percent: 0.1  # 10% random sampling to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [Positive, Negative, Neutral]

How Uncertainty Sampling Works

Potato's active learning uses uncertainty sampling to prioritize items where the classifier is least confident. The classifier predicts labels for unlabeled instances, and those with the lowest confidence scores are presented first for annotation. For the available strategies and implementation details, see the active learning guide and the strategies reference.

The classifier_name field specifies any scikit-learn compatible classifier using its full module path:

yaml
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"

Other classifier options include:

  • sklearn.ensemble.RandomForestClassifier
  • sklearn.svm.SVC (with probability=True)
  • sklearn.naive_bayes.MultinomialNB

Complete Configuration

yaml
annotation_task_name: "Active Learning for Sentiment"
 
data_files:
  - "data/reviews.json"
 
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 2000  # Reorder top N by uncertainty
  random_sample_percent: 0.1  # 10% random to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "Classify the sentiment"
    labels:
      - name: Positive
        key_value: "1"
      - name: Negative
        key_value: "2"
      - name: Neutral
        key_value: "3"
    required: true
 
annotation_guidelines:
  text: |
    ## Sentiment Classification
 
    Items are prioritized by model uncertainty.
    You may see more difficult or ambiguous cases.
 
    Focus on accuracy over speed.

Monitoring Progress

Potato's built-in logging records which instances were selected and their uncertainty scores, so you can watch how the sampling behaves as labels accumulate.

Best Practices

Cold Start

Start with diverse random sampling by setting a higher random_sample_percent:

yaml
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
  random_sample_percent: 0.2  # 20% random for initial diversity

Controlling Reordering Scope

Use max_instances_to_reorder to control how many instances are ranked by uncertainty. A larger value gives the sampler more to choose from but costs more compute on each pass:

yaml
active_learning:
  max_instances_to_reorder: 5000  # Rank top 5000 by uncertainty

Maintaining Diversity

The random_sample_percent parameter ensures some randomly sampled instances are included, preventing the model from only seeing uncertain edge cases:

yaml
active_learning:
  random_sample_percent: 0.1  # 10% random sampling

Tips for Success

A few things that help in practice. Begin with a random sample so the first model sees a spread of examples, then keep a slice of random sampling in the mix so you do not over-fit to the uncertain edge cases. Track accuracy as you go. And remember that a steady diet of the hardest items wears annotators down, so watch for fatigue and save model checkpoints in case you need to roll back.

Next Steps


Full active learning documentation at /docs/features/active-learning.