Skip to content
此页面尚未提供您所选语言的版本,当前显示英文版本。

Active Learning for Annotation

What active learning is, when it helps, and the query strategies Potato supports, uncertainty, diversity, BADGE, BALD, to label fewer items for the same model quality.

Active learning chooses which items to annotate next so a model reaches the same accuracy with far fewer labels. Instead of labeling at random, you label the items the model finds most informative. When labeling is the bottleneck, this is one of the highest-return techniques available.

See active learning for background. For the feature reference, see Active Learning.

The loop

  1. Label a small seed set.
  2. Train a quick model on what you have.
  3. Score the unlabeled pool and pick the most informative items.
  4. Annotate those, add them, retrain. Repeat.

The payoff is data efficiency: the model spends your annotation budget where it learns the most.

Query strategies Potato supports

  • Uncertainty sampling: pick items the model is least confident about (near the decision boundary). The simplest and often-effective default.
  • Diversity sampling: pick items that are different from each other, so you don't waste budget on near-duplicates.
  • BADGE: combines uncertainty and diversity using gradient embeddings.
  • BALD: Bayesian strategy that selects items expected to most reduce model uncertainty.
  • Hybrid: blends strategies.
yaml
active_learning:
  enabled: true
  schema_names: [sentiment]
  query_strategy: uncertainty   # or diversity, badge, bald, hybrid
  min_instances_for_training: 20

When active learning helps, and when it doesn't

It helps when labels are expensive, the pool is large, and a useful model can be trained on a small seed. It helps less when:

  • The task is so easy that random labeling already saturates quickly.
  • You need an unbiased held-out test set, keep your evaluation data randomly sampled, because active-learning-selected data is deliberately skewed.
  • Labels are cheap relative to engineering effort.

Further reading