Solo Mode

通过 12 阶段智能工作流，单个标注者与 LLM 协作标注整个数据集。

v2.3.0 新增

传统标注项目需要多名标注者、标注者间一致性计算、仲裁轮次和大量的协调开销。对于许多研究团队来说，主要的瓶颈不是标注界面，而是招聘、培训和管理团队的后勤工作。

Solo Mode 用单个人类专家与 LLM 的协作取代了多标注者范式。人类在一个小而精心选择的子集上提供高质量标签。LLM 从这些标签中学习，为其余部分提出标签建议，人类只审核 LLM 不确定或可能出错的案例。12 阶段工作流自动编排这一过程。

在内部基准测试中，Solo Mode 达到了与完整多标注者流程 95% 以上的一致性，同时仅需 10-15% 的人工标注量。

12 阶段工作流

Solo Mode 经历 12 个阶段。系统根据可配置的阈值自动推进，你也可以从管理面板手动触发转换。

阶段 1：种子标注

人类标注者标注初始种子集。Potato 使用基于嵌入的聚类选择多样化、有代表性的实例，以最大化数据分布的覆盖范围。

默认种子数量： 50 个实例（可通过 seed_count 配置）

阶段 2：初始 LLM 校准

LLM 将种子标注作为少样本示例接收，并标注校准批次。Potato 将 LLM 预测与留出的种子标签进行比较，以建立基线准确率。

阶段 3：混淆分析

Potato 识别人类与 LLM 之间的系统性分歧模式。它构建混淆矩阵并呈现最常见的错误类型（例如，"LLM 将中性标注为正面的比例为 40%"）。

阶段 4：指南优化

基于混淆分析，Potato 为 LLM 生成优化的标注指南。人类在应用前审查和编辑这些指南。这是一个交互步骤，标注者可以添加示例、澄清边缘案例和调整标签定义。

阶段 5：标注函数生成

受 ALCHEmist 框架启发，Potato 从现有标注中生成程序化标注函数。这些是简单的基于模式的规则（例如，"如果文本包含'优秀'且没有否定词，则标注为正面"），可以高精确率地标注简单实例，将人类和 LLM 的精力留给更难的案例。

阶段 6：主动标注

人类标注由主动学习选择的额外实例。Potato 优先选择 LLM 最不确定的实例、标注函数不一致的实例，或在嵌入空间中远离现有训练样本的实例。

阶段 7：自动化优化循环

LLM 使用更新的指南和少样本示例重新标注完整数据集。Potato 与所有人工标签进行比较，如果准确率低于阈值，则触发新一轮混淆分析和指南优化。

阶段 8：分歧探索

人类审查 LLM 和标注函数不一致的所有实例。这些通常是最有信息量和最困难的示例。人类在这些案例上的标签提供最高的边际价值。

阶段 9：边缘案例合成

Potato 使用 LLM 根据已识别的混淆模式生成合成的边缘案例。人类标注这些合成示例，然后将其添加到 LLM 的训练上下文中，以提高在最难案例上的性能。

阶段 10：级联置信度升级

LLM 为每个剩余的未标注实例分配置信度分数。按难度递减（置信度递增）的顺序将实例升级给人类。人类持续标注直到质量指标趋于稳定。

阶段 11：提示优化

受 DSPy 启发，Potato 使用积累的人工标签作为验证集运行自动化提示优化。它尝试多种提示变体（指令措辞、示例排序、思维链与直接回答），选择表现最佳的提示。

阶段 12：最终验证

人类对 LLM 标注实例的随机样本进行最终审查。如果准确率达到阈值，数据集即告完成。如果没有，系统循环回阶段 6。

配置

快速开始

最小的 Solo Mode 配置：

yaml

task_name: "Sentiment Classification"
task_dir: "."
 
data_files:
  - "data/reviews.jsonl"
 
item_properties:
  id_key: id
  text_key: text
 
solo_mode:
  enabled: true
 
  # LLM provider
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
 
  # Basic thresholds
  seed_count: 50
  accuracy_threshold: 0.92
  confidence_threshold: 0.85
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels:
      - Positive
      - Neutral
      - Negative
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

完整配置参考

yaml

solo_mode:
  enabled: true
 
  # LLM configuration
  llm:
    endpoint_type: openai        # openai, anthropic, ollama, vllm
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    temperature: 0.1             # low temperature for consistency
    max_tokens: 256
 
  # Phase control
  phases:
    seed:
      count: 50                  # number of seed instances
      selection: diversity        # diversity, random, or stratified
      embedding_model: "all-MiniLM-L6-v2"
 
    calibration:
      batch_size: 100
      holdout_fraction: 0.2      # fraction of seed used for validation
 
    confusion_analysis:
      min_samples: 30
      significance_threshold: 0.05
 
    guideline_refinement:
      auto_suggest: true         # LLM suggests guideline edits
      require_approval: true     # human must approve changes
 
    labeling_functions:
      enabled: true
      max_functions: 20
      min_precision: 0.90        # only keep high-precision rules
      min_coverage: 0.01         # must cover at least 1% of data
 
    active_labeling:
      batch_size: 25
      strategy: uncertainty       # uncertainty, diversity, or hybrid
      max_batches: 10
 
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02
 
    disagreement_exploration:
      max_instances: 200
      sort_by: confidence_gap
 
    edge_case_synthesis:
      enabled: true
      count: 50
      diversity_weight: 0.3
 
    confidence_escalation:
      escalation_budget: 200     # max instances to escalate
      batch_size: 25
      stop_when_stable: true     # stop if last batch accuracy is 100%
 
    prompt_optimization:
      enabled: true
      candidates: 10             # number of prompt variants to try
      metric: f1_macro
      search_strategy: bayesian  # bayesian, grid, or random
 
    final_validation:
      sample_size: 100
      min_accuracy: 0.92
      fallback_phase: 6          # go back to Phase 6 if validation fails
 
  # Instance prioritization across phases
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
        description: "LLM confidence below threshold"
      - name: disagreement
        weight: 0.25
        description: "LLM and labeling functions disagree"
      - name: boundary
        weight: 0.20
        description: "Near decision boundary in embedding space"
      - name: novel
        weight: 0.10
        description: "Far from all existing labeled examples"
      - name: error_pattern
        weight: 0.10
        description: "Matches known confusion patterns"
      - name: random
        weight: 0.05
        description: "Random sample for calibration"

关键能力

混淆分析

每轮标注后，Potato 构建人类与 LLM 标签之间的混淆矩阵。管理面板显示：

从 LLM 角度的每类精确率、召回率和 F1
最常见的混淆对（例如，"中性被误分类为正面：23 个实例"）
每个混淆对的示例实例
跨优化轮次的改进趋势图表

通过命令行访问混淆分析：

bash

python -m potato.solo confusion --config config.yaml

输出：

text

Confusion Analysis (Round 2)
============================
Overall Accuracy: 0.87 (target: 0.92)

Top Confusion Pairs:
  neutral -> positive:  23 instances (15.3%)
  negative -> neutral:  11 instances (7.3%)
  positive -> neutral:   5 instances (3.3%)

Per-Class Performance:
  Positive:  P=0.91  R=0.94  F1=0.92
  Neutral:   P=0.78  R=0.71  F1=0.74
  Negative:  P=0.93  R=0.88  F1=0.90

自动化优化循环

优化循环在 LLM 标注、混淆分析和指南更新之间迭代。每次迭代：

LLM 使用当前指南标注完整数据集
Potato 与所有可用的人工标签进行比较
如果准确率低于阈值，运行混淆分析
LLM 根据错误模式提出指南编辑建议
人类审查并批准编辑
循环重复（最多 max_iterations 次）

yaml

solo_mode:
  llm:
    endpoint_type: anthropic
    model: "claude-sonnet-4-20250514"
    api_key: ${ANTHROPIC_API_KEY}
 
  phases:
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02    # stop if improvement is less than 2%

标注函数（受 ALCHEmist 启发）

Potato 从人工标注中观察到的模式生成轻量级标注函数。这些不是 LLM 调用；它们是快速、确定性的规则。

示例生成的标注函数：

python

# Auto-generated labeling function 1
# Precision: 0.96, Coverage: 0.08
def lf_strong_positive_words(text):
    positive = {"excellent", "amazing", "fantastic", "outstanding", "perfect"}
    if any(w in text.lower() for w in positive):
        if not any(neg in text.lower() for neg in {"not", "never", "no"}):
            return "Positive"
    return None  # abstain
 
# Auto-generated labeling function 2
# Precision: 0.93, Coverage: 0.05
def lf_explicit_negative(text):
    negative = {"terrible", "awful", "horrible", "worst", "disgusting"}
    if any(w in text.lower() for w in negative):
        return "Negative"
    return None

配置标注函数行为：

yaml

solo_mode:
  phases:
    labeling_functions:
      enabled: true
      max_functions: 20
      min_precision: 0.90
      min_coverage: 0.01
      types:
        - keyword_match
        - regex_pattern
        - length_threshold
        - embedding_cluster

分歧探索器

分歧探索器呈现不同信号冲突的实例。对于每个实例，标注者看到：

LLM 的预测标签和置信度
标注函数投票（如果有的话）
嵌入空间中最近的已标注邻居
原始文本/内容

这是最高价值的标注活动：每个标签都解决了一个真正的歧义。

yaml

solo_mode:
  phases:
    disagreement_exploration:
      max_instances: 200
      sort_by: confidence_gap     # or "lf_disagreement" or "random"
      show_llm_reasoning: true    # display LLM's chain-of-thought
      show_nearest_neighbors: 3   # show 3 nearest labeled examples

级联置信度升级

在数据集大部分被 LLM 标注后，Potato 按置信度对所有 LLM 标注的实例进行排序，并将最低置信度的实例升级给人类。这以批次方式持续进行直到质量稳定。

yaml

solo_mode:
  phases:
    confidence_escalation:
      escalation_budget: 200
      batch_size: 25
      stop_when_stable: true
      stability_window: 3        # stop if last 3 batches are all correct

多信号实例优先级

在所有涉及人工标注的阶段中，Potato 使用加权池系统选择最有信息量的实例。六个池汇入一个统一的优先级队列：

yaml

solo_mode:
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05

uncertain：LLM 置信度低于 confidence_threshold 的实例
disagreement：LLM 和标注函数产生不同标签的实例
boundary：嵌入空间中靠近决策边界的实例
novel：远离任何现有已标注示例的实例
error_pattern：匹配之前轮次中已知混淆模式的实例
random：小的随机样本，用于保持校准和发现盲区

边缘案例合成

Potato 使用 LLM 生成针对已知弱点的合成示例：

yaml

solo_mode:
  phases:
    edge_case_synthesis:
      enabled: true
      count: 50
      diversity_weight: 0.3
      confusion_pairs:            # focus on these error types
        - ["neutral", "positive"]
        - ["negative", "neutral"]

LLM 生成在指定标签对之间模糊的示例。人类标注它们，这些标签被添加到后续 LLM 标注轮次的少样本上下文中。

提示优化（受 DSPy 启发）

在阶段 11，Potato 运行自动化提示优化以找到 LLM 的最佳指令格式：

yaml

solo_mode:
  phases:
    prompt_optimization:
      enabled: true
      candidates: 10
      metric: f1_macro
      search_strategy: bayesian
      variations:
        - instruction_style      # formal vs. conversational
        - example_ordering       # random, by-class, by-difficulty
        - reasoning_mode         # direct, chain-of-thought, self-consistency
        - example_count          # 3, 5, 10, 15 few-shot examples

监控进度

管理面板实时显示 Solo Mode 进度：

当前阶段和每个阶段内的进度
已完成的人工标签与总预算
LLM 准确率随时间的变化（每轮）
标注函数的覆盖率和精确率
置信度分布直方图
预计完成时间

通过命令行访问：

bash

python -m potato.solo status --config config.yaml

text

Solo Mode Status
================
Current Phase: 6 (Active Labeling) - Batch 3/10
Human Labels: 142 / ~300 estimated total
LLM Accuracy: 0.89 (target: 0.92)
LF Coverage: 0.23 (labeling functions cover 23% of data)
Dataset Size: 10,000 instances
  - Human labeled: 142
  - LF labeled: 2,300
  - LLM labeled: 7,558
  - Unlabeled: 0

何时使用 Solo Mode vs. 传统多标注者

使用 Solo Mode 的场景：

你有一位能提供高质量标签的领域专家
预算或后勤条件不允许雇用多名标注者
任务有明确、定义清晰的类别
你需要标注大型数据集（1,000+ 实例）
速度比衡量标注者间一致性更重要

使用传统多标注者的场景：

你需要标注者间一致性统计用于发表
任务高度主观（例如，冒犯性、幽默感）
你需要研究标注者分歧模式
法规要求多名独立标注者
标签空间复杂或不断演变（标注指南仍在开发中）

混合方法： 使用 Solo Mode 进行初始批量标注，然后分配第二个标注者对随机的 10-20% 样本进行标注以计算一致性统计。这样你既获得了 Solo Mode 的效率，又有多标注者验证的质量保证。

yaml

solo_mode:
  enabled: true
  # ... solo mode config ...
 
  # Hybrid: assign verification sample to second annotator
  verification:
    enabled: true
    sample_fraction: 0.15
    annotator: "reviewer_1"

Solo Mode

12 阶段工作流

阶段 1：种子标注

阶段 2：初始 LLM 校准

阶段 3：混淆分析

阶段 4：指南优化

阶段 5：标注函数生成

阶段 6：主动标注

阶段 7：自动化优化循环

阶段 8：分歧探索

阶段 9：边缘案例合成

阶段 10：级联置信度升级

阶段 11：提示优化

阶段 12：最终验证

配置

快速开始

完整配置参考

关键能力

混淆分析

自动化优化循环

标注函数（受 ALCHEmist 启发）

分歧探索器

级联置信度升级

多信号实例优先级

边缘案例合成

提示优化（受 DSPy 启发）

监控进度

何时使用 Solo Mode vs. 传统多标注者

延伸阅读