主动学习

使用不确定性采样优先标注有价值的样本。

主动学习通过优先选择最有信息量的样本来帮助您更智能地标注。不再随机标注，而是集中于模型最不确定的实例。

工作原理

Potato 的主动学习根据机器学习预测自动重新排序标注实例：

初始收集 - 收集最少数量的标注
训练 - 在现有标注上训练分类器
预测 - 获取未标注实例的不确定性分数
重新排序 - 优先选择不确定性最高的实例
标注 - 标注者标注优先实例
重新训练 - 定期用新标注更新模型

配置

基本设置

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment  # Which annotation schemes to use
 
  min_annotations_per_instance: 1
  min_instances_for_training: 20
  update_frequency: 50  # Retrain after every 50 annotations
  max_instances_to_reorder: 1000

完整配置

yaml

active_learning:
  enabled: true
 
  # Which schemas to use for training
  schema_names:
    - sentiment
 
  # Minimum requirements
  min_annotations_per_instance: 1
  min_instances_for_training: 20
 
  # Retraining frequency
  update_frequency: 50
 
  # How many instances to reorder
  max_instances_to_reorder: 1000
 
  # Classifier configuration
  classifier:
    type: LogisticRegression
    params:
      C: 1.0
      max_iter: 1000
 
  # Feature extraction
  vectorizer:
    type: TfidfVectorizer
    params:
      max_features: 5000
      ngram_range: [1, 2]
 
  # Model persistence
  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 5

支持的分类器

分类器	适用场景	速度
`LogisticRegression`	二分类/多分类	快
`RandomForestClassifier`	复杂模式	中等
`SVC`	小数据集	慢
`MultinomialNB`	文本分类	非常快

分类器示例

yaml

# Logistic Regression (recommended starting point)
classifier:
  type: LogisticRegression
  params:
    C: 1.0
    max_iter: 1000
 
# Random Forest
classifier:
  type: RandomForestClassifier
  params:
    n_estimators: 100
    max_depth: 10
 
# Support Vector Classifier
classifier:
  type: SVC
  params:
    kernel: rbf
    probability: true
 
# Naive Bayes
classifier:
  type: MultinomialNB
  params:
    alpha: 1.0

向量化器

向量化器	描述
`TfidfVectorizer`	TF-IDF 加权特征（推荐）
`CountVectorizer`	简单词频
`HashingVectorizer`	大词汇量的内存高效方案

yaml

# TF-IDF (recommended)
vectorizer:
  type: TfidfVectorizer
  params:
    max_features: 5000
    ngram_range: [1, 2]
    stop_words: english
 
# Count Vectorizer
vectorizer:
  type: CountVectorizer
  params:
    max_features: 3000
    ngram_range: [1, 1]
 
# Hashing Vectorizer (for large datasets)
vectorizer:
  type: HashingVectorizer
  params:
    n_features: 10000

LLM 集成

主动学习可以选择性地使用 LLM 增强实例选择：

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment
 
  # LLM-based selection
  llm_integration:
    enabled: true
    endpoint_type: vllm
    base_url: http://localhost:8000/v1
    model: meta-llama/Llama-2-7b-chat-hf
 
    # Mock mode for testing
    mock_mode: false

多方案支持

主动学习可以循环使用多个标注方案：

yaml

annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: [Positive, Negative, Neutral]
 
  - annotation_type: radio
    name: topic
    labels: [Politics, Sports, Tech, Entertainment]
 
active_learning:
  enabled: true
  schema_names:
    - sentiment
    - topic
 
  # Schema-specific settings
  schema_config:
    sentiment:
      min_instances_for_training: 30
      update_frequency: 50
    topic:
      min_instances_for_training: 50
      update_frequency: 100

模型持久化

在服务器重启时保存和重新加载训练好的模型：

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment
 
  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 5  # Keep last 5 models
 
    # Save to database instead of files
    use_database: false

监控进度

管理仪表板跟踪主动学习指标：

当前模型准确率
训练周期数
不确定性分布
剩余实例数
重新训练历史

通过 /admin 使用管理员 API 密钥访问。

最佳实践

1. 从随机采样开始

在启用主动学习之前获取初始标注：

yaml

active_learning:
  enabled: true
  min_instances_for_training: 50  # Wait for 50 annotations

2. 选择合适的分类器

LogisticRegression：快速，大多数任务的良好默认值
RandomForest：适合复杂模式，较慢
MultinomialNB：非常快，适合简单文本分类

3. 监控类别分布

主动学习可能导致类别不平衡。在管理仪表板中监控并考虑分层采样。

4. 设置合理的重新训练频率

过于频繁的重新训练浪费资源：

yaml

update_frequency: 100  # Retrain every 100 annotations

5. 启用模型持久化

保存模型以避免重启后从头训练：

yaml

model_persistence:
  enabled: true
  save_dir: "models/"

示例：完整配置

yaml

task_name: "Sentiment Analysis with Active Learning"
task_dir: "."
port: 8000
 
data_files:
  - "data/reviews.json"
 
item_properties:
  id_key: id
  text_key: text
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the sentiment?"
    labels:
      - Positive
      - Negative
      - Neutral
 
active_learning:
  enabled: true
  schema_names:
    - sentiment
 
  min_annotations_per_instance: 1
  min_instances_for_training: 30
  update_frequency: 50
  max_instances_to_reorder: 500
 
  classifier:
    type: LogisticRegression
    params:
      C: 1.0
      max_iter: 1000
 
  vectorizer:
    type: TfidfVectorizer
    params:
      max_features: 3000
      ngram_range: [1, 2]
 
  model_persistence:
    enabled: true
    save_dir: "models/"
    max_saved_models: 3
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

与 AI 支持结合

同时使用主动学习和 LLM 辅助：

yaml

active_learning:
  enabled: true
  schema_names:
    - sentiment
  min_instances_for_training: 30
 
ai_support:
  enabled: true
  endpoint_type: openai
 
  ai_config:
    model: gpt-4
    api_key: ${OPENAI_API_KEY}
 
  features:
    label_suggestions:
      enabled: true

这种组合在优先选择不确定实例的同时提供 AI 提示来帮助标注者。

故障排除

训练失败

确保有足够的标注（min_instances_for_training）
检查类别分布 - 需要所有类别的示例
验证数据格式与方案匹配

性能缓慢

减少 max_instances_to_reorder
增加 update_frequency
对大词汇量使用 HashingVectorizer

模型未更新

检查 update_frequency 设置
验证标注正在保存
查看管理仪表板中的错误

主动学习

工作原理

配置

基本设置

完整配置

支持的分类器

分类器示例

向量化器

LLM 集成

多方案支持

模型持久化

监控进度

最佳实践

1. 从随机采样开始

2. 选择合适的分类器

3. 监控类别分布

4. 设置合理的重新训练频率

5. 启用模型持久化

示例：完整配置

与 AI 支持结合

故障排除

训练失败

性能缓慢

模型未更新

延伸阅读