主动学习能够智能地选择下一个需要标注的项目，将人力集中在最关键的地方。本指南展示如何在保持模型质量的同时，将标注工作量减少最多50%。

The active learning loop: an unlabeled pool feeds a model that scores uncertainty, the most uncertain items are annotated, and the model retrains The active learning loop

什么是主动学习？

主动学习不是随机抽样数据进行标注，而是：

在当前标注数据上训练模型
识别模型不确定的项目
优先将这些项目交给人工标注
重复此过程，持续提升效率

为什么使用主动学习？

降低标注成本：用更少的标注量达到相同的模型质量
更快迭代：更早获得可用的模型
聚焦专业知识：将人工注意力集中在困难案例上
更好的覆盖：确保边缘案例得到充分代表

基本主动学习配置

yaml

annotation_task_name: "Active Learning Classification"
 
data_files:
  - "data/unlabeled_pool.json"
 
# Active learning configuration
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 1000  # Number of instances to reorder by uncertainty
  random_sample_percent: 0.1  # 10% random sampling to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [Positive, Negative, Neutral]

不确定性采样的工作原理

Potato 的主动学习使用不确定性采样来优先处理分类器最不确定的项目。分类器对未标注实例进行预测，置信度最低的项目会被优先展示给标注者。

classifier_name 字段使用完整模块路径指定任何兼容 scikit-learn 的分类器：

yaml

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"

其他分类器选项包括：

sklearn.ensemble.RandomForestClassifier
sklearn.svm.SVC（需要设置 probability=True）
sklearn.naive_bayes.MultinomialNB

完整配置

yaml

annotation_task_name: "Active Learning for Sentiment"
 
data_files:
  - "data/reviews.json"
 
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 2000  # Reorder top N by uncertainty
  random_sample_percent: 0.1  # 10% random to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "Classify the sentiment"
    labels:
      - name: Positive
        key_value: "1"
      - name: Negative
        key_value: "2"
      - name: Neutral
        key_value: "3"
    required: true
 
annotation_guidelines:
  text: |
    ## Sentiment Classification
 
    Items are prioritized by model uncertainty.
    You may see more difficult or ambiguous cases.
 
    Focus on accuracy over speed.

监控进度

通过 Potato 内置的日志功能跟踪标注进度。系统会记录哪些实例被选中及其不确定性分数，让你能够监控整个主动学习过程。

最佳实践

冷启动

通过设置较高的 random_sample_percent 从多样化的随机采样开始：

yaml

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
  random_sample_percent: 0.2  # 20% random for initial diversity

控制重排范围

使用 max_instances_to_reorder 控制按不确定性排序的实例数量。较大的值提供更好的选择效果，但需要更多计算：

yaml

active_learning:
  max_instances_to_reorder: 5000  # Rank top 5000 by uncertainty

维持多样性

random_sample_percent 参数确保包含一些随机抽样的实例，防止模型只看到不确定的边缘案例：

yaml

active_learning:
  random_sample_percent: 0.1  # 10% random sampling

成功建议

从多样性开始：随机初始样本覆盖边缘案例
监控准确率：跟踪模型性能随时间的变化
不要过度优化：保持一定的随机采样以维持覆盖范围
处理标注者疲劳：困难项目容易让人疲倦
保存模型检查点：必要时可以回滚

下一步

添加 AI 建议以加速处理不确定项目
为困难案例设置质量控制
了解众包与主动学习的结合

完整的主动学习文档请参阅 /docs/features/active-learning。