# Active Learning: Annotate Smarter, Not Harder

Source: https://www.potatoannotator.com/blog/active-learning-efficiency

Active learning decides which items to annotate next instead of leaving that to chance. Done well, it can cut the labeling you need by roughly half without hurting model quality. Here is how to set it up in Potato.

## What is Active Learning?

Instead of randomly sampling data to annotate, active learning:

1. Trains a model on current annotations
2. Identifies items where the model is uncertain
3. Prioritizes those items for human annotation
4. Repeats, continuously improving efficiency

## Why Use Active Learning?

The payoff is mostly about where your annotators spend their time. You label fewer items to reach the same model quality, so you get a usable model sooner. The hard, ambiguous cases get human attention, and edge cases that random sampling would skip over tend to surface earlier.

## Basic Active Learning Setup

```yaml
annotation_task_name: "Active Learning Classification"

data_files:
  - "data/unlabeled_pool.json"

# Active learning configuration
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"

  # Sampling settings
  max_instances_to_reorder: 1000  # Number of instances to reorder by uncertainty
  random_sample_percent: 0.1  # 10% random sampling to maintain diversity

annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [Positive, Negative, Neutral]
```

## How Uncertainty Sampling Works

Potato's active learning uses uncertainty sampling to prioritize items where the classifier is least confident. The classifier predicts labels for unlabeled instances, and those with the lowest confidence scores are presented first for annotation. For the available strategies and implementation details, see the [active learning guide](https://github.com/davidjurgens/potato/blob/master/docs/ai-intelligence/active_learning_guide.md) and the [strategies reference](https://github.com/davidjurgens/potato/blob/master/docs/ai-intelligence/active_learning_strategies.md).

The `classifier_name` field specifies any scikit-learn compatible classifier using its full module path:

```yaml
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
```

Other classifier options include:
- `sklearn.ensemble.RandomForestClassifier`
- `sklearn.svm.SVC` (with `probability=True`)
- `sklearn.naive_bayes.MultinomialNB`

## Complete Configuration

```yaml
annotation_task_name: "Active Learning for Sentiment"

data_files:
  - "data/reviews.json"

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"

  # Sampling settings
  max_instances_to_reorder: 2000  # Reorder top N by uncertainty
  random_sample_percent: 0.1  # 10% random to maintain diversity

annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "Classify the sentiment"
    labels:
      - name: Positive
        key_value: "1"
      - name: Negative
        key_value: "2"
      - name: Neutral
        key_value: "3"
    required: true

annotation_guidelines:
  text: |
    ## Sentiment Classification

    Items are prioritized by model uncertainty.
    You may see more difficult or ambiguous cases.

    Focus on accuracy over speed.
```

## Monitoring Progress

Potato's built-in logging records which instances were selected and their uncertainty scores, so you can watch how the sampling behaves as labels accumulate.

## Best Practices

### Cold Start

Start with diverse random sampling by setting a higher `random_sample_percent`:

```yaml
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
  random_sample_percent: 0.2  # 20% random for initial diversity
```

### Controlling Reordering Scope

Use `max_instances_to_reorder` to control how many instances are ranked by uncertainty. A larger value gives the sampler more to choose from but costs more compute on each pass:

```yaml
active_learning:
  max_instances_to_reorder: 5000  # Rank top 5000 by uncertainty
```

### Maintaining Diversity

The `random_sample_percent` parameter ensures some randomly sampled instances are included, preventing the model from only seeing uncertain edge cases:

```yaml
active_learning:
  random_sample_percent: 0.1  # 10% random sampling
```

## Tips for Success

A few things that help in practice. Begin with a random sample so the first model sees a spread of examples, then keep a slice of random sampling in the mix so you do not over-fit to the uncertain edge cases. Track accuracy as you go. And remember that a steady diet of the hardest items wears annotators down, so watch for fatigue and save model checkpoints in case you need to roll back.

## Next Steps

- Add [AI suggestions](/blog/llm-integration-guide) to speed up uncertain items
- Set up [quality control](/blog/quality-control-strategies) for difficult cases
- Learn about [crowdsourcing](/blog/prolific-integration) with active learning

---

*Full active learning documentation at [/docs/features/active-learning](/docs/features/active-learning).*
