# Active Learning

Source: https://www.potatoannotator.com/docs/features/active-learning

Active learning helps you annotate smarter by prioritizing the most informative examples. Instead of annotating randomly, focus on instances where the model is most uncertain.

## How It Works

Potato's active learning automatically reorders annotation instances based on machine learning predictions:

1. **Initial Collection** - Gather a minimum number of annotations
2. **Train** - Train a classifier on existing annotations
3. **Predict** - Get uncertainty scores for unannotated instances
4. **Reorder** - Prioritize instances with highest uncertainty
5. **Annotate** - Annotators label the prioritized instances
6. **Retrain** - Update the model periodically with new annotations

## Configuration

### Basic Setup

```yaml
active_learning:
  enabled: true
  schema_names:
    - sentiment  # Which annotation schemes to use

  min_annotations_per_instance: 1
  min_instances_for_training: 20
  update_frequency: 50  # Retrain after every 50 annotations
  max_instances_to_reorder: 1000
```

### Full Configuration

```yaml
active_learning:
  enabled: true

  # Which schemas to use for training
  schema_names:
    - sentiment

  # Minimum requirements
  min_annotations_per_instance: 1
  min_instances_for_training: 20

  # Retraining frequency
  update_frequency: 50

  # How many instances to reorder
  max_instances_to_reorder: 1000

  # Classifier configuration
  classifier_name: sklearn.linear_model.LogisticRegression
  classifier_params:
    C: 1.0
    max_iter: 1000

  # Query strategy
  query_strategy: uncertainty  # uncertainty, diversity, badge, bald, hybrid

  # Feature extraction
  vectorizer:
    type: TfidfVectorizer
    params:
      max_features: 5000
      ngram_range: [1, 2]

  # Model persistence
  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 5
```

## Query Strategies

Potato supports five query strategies for selecting the most informative instances:

| Strategy | Description |
|----------|-------------|
| `uncertainty` | Selects instances where the model is least confident (default) |
| `diversity` | Selects instances that are most different from already-annotated data |
| `badge` | Batch Active learning by Diverse Gradient Embeddings |
| `bald` | Bayesian Active Learning by Disagreement |
| `hybrid` | Ensemble combining multiple strategies |

```yaml
active_learning:
  query_strategy: uncertainty  # or diversity, badge, bald, hybrid
```

## Supported Classifiers

Classifiers are specified using their full sklearn import path via `classifier_name`:

| Classifier | sklearn Path | Best For | Speed |
|------------|-------------|----------|-------|
| Logistic Regression | `sklearn.linear_model.LogisticRegression` | Binary/multiclass classification | Fast |
| Random Forest | `sklearn.ensemble.RandomForestClassifier` | Complex patterns | Medium |
| SVC | `sklearn.svm.SVC` | Small datasets | Slow |
| Multinomial NB | `sklearn.naive_bayes.MultinomialNB` | Text classification | Very Fast |

### Classifier Examples

```yaml
# Logistic Regression (recommended starting point)
classifier_name: sklearn.linear_model.LogisticRegression
classifier_params:
  C: 1.0
  max_iter: 1000

# Random Forest
classifier_name: sklearn.ensemble.RandomForestClassifier
classifier_params:
  n_estimators: 100
  max_depth: 10

# Support Vector Classifier
classifier_name: sklearn.svm.SVC
classifier_params:
  kernel: rbf
  probability: true

# Naive Bayes
classifier_name: sklearn.naive_bayes.MultinomialNB
classifier_params:
  alpha: 1.0
```

## Vectorizers

| Vectorizer | Description |
|------------|-------------|
| `TfidfVectorizer` | TF-IDF weighted features (recommended) |
| `CountVectorizer` | Simple word counts |
| `HashingVectorizer` | Memory-efficient for large vocabularies |

```yaml
# TF-IDF (recommended)
vectorizer:
  type: TfidfVectorizer
  params:
    max_features: 5000
    ngram_range: [1, 2]
    stop_words: english

# Count Vectorizer
vectorizer:
  type: CountVectorizer
  params:
    max_features: 3000
    ngram_range: [1, 1]

# Hashing Vectorizer (for large datasets)
vectorizer:
  type: HashingVectorizer
  params:
    n_features: 10000
```

## LLM Integration

Active learning can optionally use LLMs for enhanced instance selection:

```yaml
active_learning:
  enabled: true
  schema_names:
    - sentiment

  # LLM-based selection
  llm_integration:
    enabled: true
    endpoint_type: vllm
    base_url: http://localhost:8000/v1
    model: meta-llama/Llama-2-7b-chat-hf

    # Mock mode for testing
    mock_mode: false
```

## Multi-Schema Support

Active learning can cycle through multiple annotation schemas:

```yaml
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: [Positive, Negative, Neutral]

  - annotation_type: radio
    name: topic
    labels: [Politics, Sports, Tech, Entertainment]

active_learning:
  enabled: true
  schema_names:
    - sentiment
    - topic

  # Schema-specific settings
  schema_config:
    sentiment:
      min_instances_for_training: 30
      update_frequency: 50
    topic:
      min_instances_for_training: 50
      update_frequency: 100
```

## Model Persistence

Save and reload trained models across server restarts:

```yaml
active_learning:
  enabled: true
  schema_names:
    - sentiment

  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 5  # Keep last 5 models

    # Save to database instead of files
    use_database: false
```

## Monitoring Progress

The admin dashboard tracks active learning metrics:

- Current model accuracy
- Training cycle count
- Uncertainty distribution
- Instances remaining
- Retraining history

Access via `/admin` with your admin API key.

## Best Practices

### 1. Start with Random Sampling

Get initial annotations before enabling active learning:

```yaml
active_learning:
  enabled: true
  min_instances_for_training: 50  # Wait for 50 annotations
```

### 2. Choose Appropriate Classifiers

- **LogisticRegression**: Fast, good default for most tasks
- **RandomForest**: Better for complex patterns, slower
- **MultinomialNB**: Very fast, good for simple text classification

### 3. Monitor Class Distribution

Active learning can create class imbalance. Monitor in the admin dashboard and consider stratified sampling.

### 4. Set Reasonable Retrain Frequency

Too frequent retraining wastes resources:

```yaml
update_frequency: 100  # Retrain every 100 annotations
```

### 5. Enable Model Persistence

Save models to avoid retraining from scratch on restart:

```yaml
model_persistence:
  enabled: true
  save_dir: "models/"
```

## Example: Complete Configuration

```yaml
annotation_task_name: "Sentiment Analysis with Active Learning"
task_dir: "."
port: 8000

data_files:
  - "data/reviews.json"

item_properties:
  id_key: id
  text_key: text

annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the sentiment?"
    labels:
      - Positive
      - Negative
      - Neutral

active_learning:
  enabled: true
  schema_names:
    - sentiment

  min_annotations_per_instance: 1
  min_instances_for_training: 30
  update_frequency: 50
  max_instances_to_reorder: 500

  classifier_name: sklearn.linear_model.LogisticRegression
  classifier_params:
    C: 1.0
    max_iter: 1000

  query_strategy: uncertainty

  vectorizer:
    type: TfidfVectorizer
    params:
      max_features: 3000
      ngram_range: [1, 2]

  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 3

output_annotation_dir: "output/"
export_annotation_format: "json"
user_config:
  allow_all_users: true
```

## Combining with AI Support

Use both active learning and LLM assistance:

```yaml
active_learning:
  enabled: true
  schema_names:
    - sentiment
  min_instances_for_training: 30

ai_support:
  enabled: true
  endpoint_type: openai

  ai_config:
    model: gpt-4
    api_key: ${OPENAI_API_KEY}

  features:
    label_suggestions:
      enabled: true
```

This combination prioritizes uncertain instances while providing AI hints to help annotators.

## Troubleshooting

### Training Failures

- Ensure sufficient annotations (`min_instances_for_training`)
- Check class distribution - need examples of all classes
- Verify data format matches schema

### Slow Performance

- Reduce `max_instances_to_reorder`
- Increase `update_frequency`
- Use `HashingVectorizer` for large vocabularies

### Model Not Updating

- Check `update_frequency` setting
- Verify annotations are being saved
- Review admin dashboard for errors

## Further Reading

- [AI Support](/docs/features/ai-support) - LLM-assisted annotation
- [Task Assignment](/docs/features/task-assignment) - Assignment strategies
- [Admin Dashboard](/docs/features/admin-dashboard) - Monitor active learning metrics

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/active_learning_guide.md).
