Active Learning

Use uncertainty sampling to prioritize annotation on valuable examples.

Active Learning

Active learning helps you annotate smarter by prioritizing the most informative examples. Instead of annotating randomly, focus on instances where the model is most uncertain.

How It Works

Potato's active learning automatically reorders annotation instances based on machine learning predictions:

Initial Collection - Gather a minimum number of annotations
Train - Train a classifier on existing annotations
Predict - Get uncertainty scores for unannotated instances
Reorder - Prioritize instances with highest uncertainty
Annotate - Annotators label the prioritized instances
Retrain - Update the model periodically with new annotations

Configuration

Basic Setup

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment  # Which annotation schemes to use
 
  min_annotations_per_instance: 1
  min_instances_for_training: 20
  update_frequency: 50  # Retrain after every 50 annotations
  max_instances_to_reorder: 1000

Full Configuration

yaml

active_learning:
  enabled: true
 
  # Which schemas to use for training
  schema_names:
    - sentiment
 
  # Minimum requirements
  min_annotations_per_instance: 1
  min_instances_for_training: 20
 
  # Retraining frequency
  update_frequency: 50
 
  # How many instances to reorder
  max_instances_to_reorder: 1000
 
  # Classifier configuration
  classifier:
    type: LogisticRegression
    params:
      C: 1.0
      max_iter: 1000
 
  # Feature extraction
  vectorizer:
    type: TfidfVectorizer
    params:
      max_features: 5000
      ngram_range: [1, 2]
 
  # Model persistence
  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 5

Supported Classifiers

Classifier	Best For	Speed
`LogisticRegression`	Binary/multiclass classification	Fast
`RandomForestClassifier`	Complex patterns	Medium
`SVC`	Small datasets	Slow
`MultinomialNB`	Text classification	Very Fast

Classifier Examples

yaml

# Logistic Regression (recommended starting point)
classifier:
  type: LogisticRegression
  params:
    C: 1.0
    max_iter: 1000
 
# Random Forest
classifier:
  type: RandomForestClassifier
  params:
    n_estimators: 100
    max_depth: 10
 
# Support Vector Classifier
classifier:
  type: SVC
  params:
    kernel: rbf
    probability: true
 
# Naive Bayes
classifier:
  type: MultinomialNB
  params:
    alpha: 1.0

Vectorizers

Vectorizer	Description
`TfidfVectorizer`	TF-IDF weighted features (recommended)
`CountVectorizer`	Simple word counts
`HashingVectorizer`	Memory-efficient for large vocabularies

yaml

# TF-IDF (recommended)
vectorizer:
  type: TfidfVectorizer
  params:
    max_features: 5000
    ngram_range: [1, 2]
    stop_words: english
 
# Count Vectorizer
vectorizer:
  type: CountVectorizer
  params:
    max_features: 3000
    ngram_range: [1, 1]
 
# Hashing Vectorizer (for large datasets)
vectorizer:
  type: HashingVectorizer
  params:
    n_features: 10000

LLM Integration

Active learning can optionally use LLMs for enhanced instance selection:

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment
 
  # LLM-based selection
  llm_integration:
    enabled: true
    endpoint_type: vllm
    base_url: http://localhost:8000/v1
    model: meta-llama/Llama-2-7b-chat-hf
 
    # Mock mode for testing
    mock_mode: false

Multi-Schema Support

Active learning can cycle through multiple annotation schemas:

yaml

annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: [Positive, Negative, Neutral]
 
  - annotation_type: radio
    name: topic
    labels: [Politics, Sports, Tech, Entertainment]
 
active_learning:
  enabled: true
  schema_names:
    - sentiment
    - topic
 
  # Schema-specific settings
  schema_config:
    sentiment:
      min_instances_for_training: 30
      update_frequency: 50
    topic:
      min_instances_for_training: 50
      update_frequency: 100

Model Persistence

Save and reload trained models across server restarts:

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment
 
  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 5  # Keep last 5 models
 
    # Save to database instead of files
    use_database: false

Monitoring Progress

The admin dashboard tracks active learning metrics:

Current model accuracy
Training cycle count
Uncertainty distribution
Instances remaining
Retraining history

Access via /admin with your admin API key.

Best Practices

1. Start with Random Sampling

Get initial annotations before enabling active learning:

yaml

active_learning:
  enabled: true
  min_instances_for_training: 50  # Wait for 50 annotations

2. Choose Appropriate Classifiers

LogisticRegression: Fast, good default for most tasks
RandomForest: Better for complex patterns, slower
MultinomialNB: Very fast, good for simple text classification

3. Monitor Class Distribution

Active learning can create class imbalance. Monitor in the admin dashboard and consider stratified sampling.

4. Set Reasonable Retrain Frequency

Too frequent retraining wastes resources:

yaml

update_frequency: 100  # Retrain every 100 annotations

5. Enable Model Persistence

Save models to avoid retraining from scratch on restart:

yaml

model_persistence:
  enabled: true
  save_dir: "models/"

Example: Complete Configuration

yaml

task_name: "Sentiment Analysis with Active Learning"
task_dir: "."
port: 8000
 
data_files:
  - "data/reviews.json"
 
item_properties:
  id_key: id
  text_key: text
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the sentiment?"
    labels:
      - Positive
      - Negative
      - Neutral
 
active_learning:
  enabled: true
  schema_names:
    - sentiment
 
  min_annotations_per_instance: 1
  min_instances_for_training: 30
  update_frequency: 50
  max_instances_to_reorder: 500
 
  classifier:
    type: LogisticRegression
    params:
      C: 1.0
      max_iter: 1000
 
  vectorizer:
    type: TfidfVectorizer
    params:
      max_features: 3000
      ngram_range: [1, 2]
 
  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 3
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

Combining with AI Support

Use both active learning and LLM assistance:

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment
  min_instances_for_training: 30
 
ai_support:
  enabled: true
  endpoint_type: openai
 
  ai_config:
    model: gpt-4
    api_key: ${OPENAI_API_KEY}
 
  features:
    label_suggestions:
      enabled: true

This combination prioritizes uncertain instances while providing AI hints to help annotators.

Troubleshooting

Training Failures

Ensure sufficient annotations (min_instances_for_training)
Check class distribution - need examples of all classes
Verify data format matches schema

Slow Performance

Reduce max_instances_to_reorder
Increase update_frequency
Use HashingVectorizer for large vocabularies

Model Not Updating

Check update_frequency setting
Verify annotations are being saved
Review admin dashboard for errors

Active Learning

Active Learning

How It Works

Configuration

Basic Setup

Full Configuration

Supported Classifiers

Classifier Examples

Vectorizers

LLM Integration

Multi-Schema Support

Model Persistence

Monitoring Progress

Best Practices

1. Start with Random Sampling

2. Choose Appropriate Classifiers

3. Monitor Class Distribution

4. Set Reasonable Retrain Frequency

5. Enable Model Persistence

Example: Complete Configuration

Combining with AI Support

Troubleshooting

Training Failures

Slow Performance

Model Not Updating

Further Reading