Active Learning
Use uncertainty sampling to prioritize annotation on valuable examples.
Active Learning
Active learning helps you annotate smarter by prioritizing the most informative examples. Instead of annotating randomly, focus on instances where the model is most uncertain.
How It Works
Potato's active learning automatically reorders annotation instances based on machine learning predictions:
- Initial Collection - Gather a minimum number of annotations
- Train - Train a classifier on existing annotations
- Predict - Get uncertainty scores for unannotated instances
- Reorder - Prioritize instances with highest uncertainty
- Annotate - Annotators label the prioritized instances
- Retrain - Update the model periodically with new annotations
Configuration
Basic Setup
active_learning:
enabled: true
schema_names:
- sentiment # Which annotation schemes to use
min_annotations_per_instance: 1
min_instances_for_training: 20
update_frequency: 50 # Retrain after every 50 annotations
max_instances_to_reorder: 1000Full Configuration
active_learning:
enabled: true
# Which schemas to use for training
schema_names:
- sentiment
# Minimum requirements
min_annotations_per_instance: 1
min_instances_for_training: 20
# Retraining frequency
update_frequency: 50
# How many instances to reorder
max_instances_to_reorder: 1000
# Classifier configuration
classifier:
type: LogisticRegression
params:
C: 1.0
max_iter: 1000
# Feature extraction
vectorizer:
type: TfidfVectorizer
params:
max_features: 5000
ngram_range: [1, 2]
# Model persistence
model_persistence:
enabled: true
save_dir: "models/"
max_saved_models: 5Supported Classifiers
| Classifier | Best For | Speed |
|---|---|---|
LogisticRegression | Binary/multiclass classification | Fast |
RandomForestClassifier | Complex patterns | Medium |
SVC | Small datasets | Slow |
MultinomialNB | Text classification | Very Fast |
Classifier Examples
# Logistic Regression (recommended starting point)
classifier:
type: LogisticRegression
params:
C: 1.0
max_iter: 1000
# Random Forest
classifier:
type: RandomForestClassifier
params:
n_estimators: 100
max_depth: 10
# Support Vector Classifier
classifier:
type: SVC
params:
kernel: rbf
probability: true
# Naive Bayes
classifier:
type: MultinomialNB
params:
alpha: 1.0Vectorizers
| Vectorizer | Description |
|---|---|
TfidfVectorizer | TF-IDF weighted features (recommended) |
CountVectorizer | Simple word counts |
HashingVectorizer | Memory-efficient for large vocabularies |
# TF-IDF (recommended)
vectorizer:
type: TfidfVectorizer
params:
max_features: 5000
ngram_range: [1, 2]
stop_words: english
# Count Vectorizer
vectorizer:
type: CountVectorizer
params:
max_features: 3000
ngram_range: [1, 1]
# Hashing Vectorizer (for large datasets)
vectorizer:
type: HashingVectorizer
params:
n_features: 10000LLM Integration
Active learning can optionally use LLMs for enhanced instance selection:
active_learning:
enabled: true
schema_names:
- sentiment
# LLM-based selection
llm_integration:
enabled: true
endpoint_type: vllm
base_url: http://localhost:8000/v1
model: meta-llama/Llama-2-7b-chat-hf
# Mock mode for testing
mock_mode: falseMulti-Schema Support
Active learning can cycle through multiple annotation schemas:
annotation_schemes:
- annotation_type: radio
name: sentiment
labels: [Positive, Negative, Neutral]
- annotation_type: radio
name: topic
labels: [Politics, Sports, Tech, Entertainment]
active_learning:
enabled: true
schema_names:
- sentiment
- topic
# Schema-specific settings
schema_config:
sentiment:
min_instances_for_training: 30
update_frequency: 50
topic:
min_instances_for_training: 50
update_frequency: 100Model Persistence
Save and reload trained models across server restarts:
active_learning:
enabled: true
schema_names:
- sentiment
model_persistence:
enabled: true
save_dir: "models/"
max_saved_models: 5 # Keep last 5 models
# Save to database instead of files
use_database: falseMonitoring Progress
The admin dashboard tracks active learning metrics:
- Current model accuracy
- Training cycle count
- Uncertainty distribution
- Instances remaining
- Retraining history
Access via /admin with your admin API key.
Best Practices
1. Start with Random Sampling
Get initial annotations before enabling active learning:
active_learning:
enabled: true
min_instances_for_training: 50 # Wait for 50 annotations2. Choose Appropriate Classifiers
- LogisticRegression: Fast, good default for most tasks
- RandomForest: Better for complex patterns, slower
- MultinomialNB: Very fast, good for simple text classification
3. Monitor Class Distribution
Active learning can create class imbalance. Monitor in the admin dashboard and consider stratified sampling.
4. Set Reasonable Retrain Frequency
Too frequent retraining wastes resources:
update_frequency: 100 # Retrain every 100 annotations5. Enable Model Persistence
Save models to avoid retraining from scratch on restart:
model_persistence:
enabled: true
save_dir: "models/"Example: Complete Configuration
task_name: "Sentiment Analysis with Active Learning"
task_dir: "."
port: 8000
data_files:
- "data/reviews.json"
item_properties:
id_key: id
text_key: text
annotation_schemes:
- annotation_type: radio
name: sentiment
description: "What is the sentiment?"
labels:
- Positive
- Negative
- Neutral
active_learning:
enabled: true
schema_names:
- sentiment
min_annotations_per_instance: 1
min_instances_for_training: 30
update_frequency: 50
max_instances_to_reorder: 500
classifier:
type: LogisticRegression
params:
C: 1.0
max_iter: 1000
vectorizer:
type: TfidfVectorizer
params:
max_features: 3000
ngram_range: [1, 2]
model_persistence:
enabled: true
save_dir: "models/"
max_saved_models: 3
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: trueCombining with AI Support
Use both active learning and LLM assistance:
active_learning:
enabled: true
schema_names:
- sentiment
min_instances_for_training: 30
ai_support:
enabled: true
endpoint_type: openai
ai_config:
model: gpt-4
api_key: ${OPENAI_API_KEY}
features:
label_suggestions:
enabled: trueThis combination prioritizes uncertain instances while providing AI hints to help annotators.
Troubleshooting
Training Failures
- Ensure sufficient annotations (
min_instances_for_training) - Check class distribution - need examples of all classes
- Verify data format matches schema
Slow Performance
- Reduce
max_instances_to_reorder - Increase
update_frequency - Use
HashingVectorizerfor large vocabularies
Model Not Updating
- Check
update_frequencysetting - Verify annotations are being saved
- Review admin dashboard for errors