Active Learning: Annotate Smarter, Not Harder
How to use uncertainty sampling to prioritize annotations and reduce the total labeling effort by up to 50%.
Active Learning: Annotate Smarter, Not Harder
Active learning intelligently selects which items to annotate next, focusing human effort where it matters most. This guide shows how to reduce annotation effort by up to 50% while maintaining model quality.
What is Active Learning?
Instead of randomly sampling data to annotate, active learning:
- Trains a model on current annotations
- Identifies items where the model is uncertain
- Prioritizes those items for human annotation
- Repeats, continuously improving efficiency
Why Use Active Learning?
- Reduce annotation cost: Label fewer items for same model quality
- Faster iteration: Get usable models sooner
- Focus expertise: Human attention on difficult cases
- Better coverage: Ensure edge cases are represented
Basic Active Learning Setup
annotation_task_name: "Active Learning Classification"
data_files:
- "data/unlabeled_pool.json"
# Active learning configuration
active_learning:
enabled: true
classifier_name: "sklearn.linear_model.LogisticRegression"
# Sampling settings
max_instances_to_reorder: 1000 # Number of instances to reorder by uncertainty
random_sample_percent: 0.1 # 10% random sampling to maintain diversity
annotation_schemes:
- annotation_type: radio
name: category
labels: [Positive, Negative, Neutral]How Uncertainty Sampling Works
Potato's active learning uses uncertainty sampling to prioritize items where the classifier is least confident. The classifier predicts labels for unlabeled instances, and those with the lowest confidence scores are presented first for annotation.
The classifier_name field specifies any scikit-learn compatible classifier using its full module path:
active_learning:
enabled: true
classifier_name: "sklearn.linear_model.LogisticRegression"Other classifier options include:
sklearn.ensemble.RandomForestClassifiersklearn.svm.SVC(withprobability=True)sklearn.naive_bayes.MultinomialNB
Complete Configuration
annotation_task_name: "Active Learning for Sentiment"
data_files:
- "data/reviews.json"
active_learning:
enabled: true
classifier_name: "sklearn.linear_model.LogisticRegression"
# Sampling settings
max_instances_to_reorder: 2000 # Reorder top N by uncertainty
random_sample_percent: 0.1 # 10% random to maintain diversity
annotation_schemes:
- annotation_type: radio
name: sentiment
description: "Classify the sentiment"
labels:
- name: Positive
key_value: "1"
- name: Negative
key_value: "2"
- name: Neutral
key_value: "3"
required: true
annotation_guidelines:
text: |
## Sentiment Classification
Items are prioritized by model uncertainty.
You may see more difficult or ambiguous cases.
Focus on accuracy over speed.Monitoring Progress
Track annotation progress through Potato's built-in logging. The system logs which instances were selected and their uncertainty scores, allowing you to monitor the active learning process.
Best Practices
Cold Start
Start with diverse random sampling by setting a higher random_sample_percent:
active_learning:
enabled: true
classifier_name: "sklearn.linear_model.LogisticRegression"
random_sample_percent: 0.2 # 20% random for initial diversityControlling Reordering Scope
Use max_instances_to_reorder to control how many instances are ranked by uncertainty. A larger value provides better selection but requires more computation:
active_learning:
max_instances_to_reorder: 5000 # Rank top 5000 by uncertaintyMaintaining Diversity
The random_sample_percent parameter ensures some randomly sampled instances are included, preventing the model from only seeing uncertain edge cases:
active_learning:
random_sample_percent: 0.1 # 10% random samplingTips for Success
- Start diverse: Random initial sample covers edge cases
- Monitor accuracy: Track model performance over time
- Don't over-optimize: Some random sampling maintains coverage
- Handle annotator fatigue: Difficult items are tiring
- Save model checkpoints: Enable rollback if needed
Next Steps
- Add AI suggestions to speed up uncertain items
- Set up quality control for difficult cases
- Learn about crowdsourcing with active learning
Full active learning documentation at /docs/features/active-learning.