培训阶段

在主要任务之前使用练习题培训和筛选标注者。

Potato 2.0 包含一个可选的培训阶段，帮助在标注者开始主要标注任务之前对其进行资格筛选。标注者回答具有已知正确答案的练习题，并获得即时的表现反馈。

用例

确保标注者理解任务
过滤低质量标注者
在真实标注前提供指导性练习
收集基线质量指标
通过示例教授标注指南

工作原理

标注者完成一组培训题
每次回答后获得即时反馈
进度根据通过标准进行追踪
只有通过的标注者才能进入主要任务

配置

基本设置

yaml

phases:
  training:
    enabled: true
    data_file: "data/training_data.json"
    schema_name: sentiment  # Which annotation scheme to train
 
    # Passing criteria
    passing_criteria:
      min_correct: 8  # Must get at least 8 correct
      total_questions: 10

完整配置

yaml

phases:
  training:
    enabled: true
    data_file: "data/training_data.json"
    schema_name: sentiment
 
    passing_criteria:
      # Different criteria options (choose one or combine)
      min_correct: 8
      require_all_correct: false
      max_mistakes: 3
      max_mistakes_per_question: 2
 
    # Allow retries
    retries:
      enabled: true
      max_retries: 3
 
    # Show explanations for incorrect answers
    show_explanations: true
 
    # Randomize question order
    randomize: true

通过标准

你可以为培训阶段设置各种通过标准：

最少正确数

yaml

passing_criteria:
  min_correct: 8
  total_questions: 10

标注者必须在 10 题中至少答对 8 题。

要求全部正确

yaml

passing_criteria:
  require_all_correct: true

标注者必须答对每一题才能通过。

最大错误数

yaml

passing_criteria:
  max_mistakes: 3

标注者在累计 3 次错误后被取消资格。

每题最大错误数

yaml

passing_criteria:
  max_mistakes_per_question: 2

标注者在任何单题上犯 2 次错误后被取消资格。

组合标准

yaml

passing_criteria:
  min_correct: 8
  max_mistakes_per_question: 3

必须答对 8 题，且任何单题不能错超过 3 次。

培训数据格式

培训数据必须包含正确答案和可选的解释：

json

[
  {
    "id": "train_1",
    "text": "I absolutely love this product! Best purchase ever!",
    "correct_answers": {
      "sentiment": "Positive"
    },
    "explanation": "This text expresses strong positive sentiment with words like 'love' and 'best'."
  },
  {
    "id": "train_2",
    "text": "This is the worst service I've ever experienced.",
    "correct_answers": {
      "sentiment": "Negative"
    },
    "explanation": "The words 'worst' and the overall complaint indicate negative sentiment."
  },
  {
    "id": "train_3",
    "text": "The package arrived on time.",
    "correct_answers": {
      "sentiment": "Neutral"
    },
    "explanation": "This is a factual statement without emotional indicators."
  }
]

多模式培训

对于有多个标注模式的任务：

json

{
  "id": "train_1",
  "text": "Apple announced new iPhone features yesterday.",
  "correct_answers": {
    "sentiment": "Neutral",
    "topic": "Technology"
  },
  "explanation": {
    "sentiment": "This is a factual news statement.",
    "topic": "The text discusses Apple and iPhone, which are tech topics."
  }
}

用户体验

培训流程

用户看到"培训阶段"指示器
显示带标注表单的问题
用户提交答案
立即显示反馈：
- 正确：绿色对勾，进入下一题
- 错误：红色叉号，显示解释，重试选项

反馈展示

当标注者回答错误时：

高亮正确答案
显示提供的解释
出现重试按钮（如果启用重试）
显示通过标准的进度

管理员监控

在管理员仪表板中追踪培训表现：

完成率
平均正确答案数
通过/失败率
培训花费时间
每题准确率

通过 /admin API 端点访问：

text

GET /api/admin/training/stats
GET /api/admin/training/user/{user_id}

示例：情感分析培训

yaml

task_name: "Sentiment Analysis"
task_dir: "."
port: 8000
 
# Main annotation data
data_files:
  - "data/reviews.json"
 
item_properties:
  id_key: id
  text_key: text
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the sentiment of this review?"
    labels:
      - Positive
      - Negative
      - Neutral
 
# Training phase configuration
phases:
  training:
    enabled: true
    data_file: "data/training_questions.json"
    schema_name: sentiment
 
    passing_criteria:
      min_correct: 8
      total_questions: 10
      max_mistakes_per_question: 2
 
    retries:
      enabled: true
      max_retries: 3
 
    show_explanations: true
    randomize: true
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

示例：NER 培训

yaml

annotation_schemes:
  - annotation_type: span
    name: entities
    description: "Highlight named entities"
    labels:
      - Person
      - Organization
      - Location
      - Date
 
phases:
  training:
    enabled: true
    data_file: "data/ner_training.json"
    schema_name: entities
 
    passing_criteria:
      min_correct: 7
      total_questions: 10
 
    show_explanations: true

片段标注的培训数据：

json

{
  "id": "train_1",
  "text": "Tim Cook announced that Apple will open a new store in New York on March 15.",
  "correct_answers": {
    "entities": [
      {"start": 0, "end": 8, "label": "Person"},
      {"start": 24, "end": 29, "label": "Organization"},
      {"start": 54, "end": 62, "label": "Location"},
      {"start": 66, "end": 74, "label": "Date"}
    ]
  },
  "explanation": "Tim Cook is a Person, Apple is an Organization, New York is a Location, and March 15 is a Date."
}

最佳实践

1. 从简单开始

先从简单的例子开始，再引入边界情况：

json

[
  {"text": "I love this!", "correct_answers": {"sentiment": "Positive"}},
  {"text": "I hate this!", "correct_answers": {"sentiment": "Negative"}},
  {"text": "It arrived yesterday.", "correct_answers": {"sentiment": "Neutral"}}
]

2. 覆盖所有标签

确保培训包含每个可能标签的示例：

json

[
  {"correct_answers": {"sentiment": "Positive"}},
  {"correct_answers": {"sentiment": "Negative"}},
  {"correct_answers": {"sentiment": "Neutral"}}
]

3. 编写清晰的解释

解释应该教授标注指南：

json

{
  "explanation": "While this text mentions a problem, the overall tone is constructive and the reviewer expresses satisfaction with the resolution. This makes it Positive rather than Negative."
}

4. 设置合理的标准

不必要时不要要求完美：

yaml

# Too strict - may lose good annotators
passing_criteria:
  require_all_correct: true
 
# Better - allows for learning
passing_criteria:
  min_correct: 8
  total_questions: 10

5. 包含边界情况

添加棘手的例子以准备标注者：

json

{
  "text": "Not bad at all, I guess it could be worse.",
  "correct_answers": {"sentiment": "Neutral"},
  "explanation": "Despite negative words like 'not bad' and 'worse', this is actually a lukewarm endorsement - neutral rather than positive or negative."
}

与工作流集成

培训与多阶段工作流集成：

yaml

phases:
  consent:
    enabled: true
    data_file: "data/consent.json"
 
  prestudy:
    enabled: true
    data_file: "data/demographics.json"
 
  instructions:
    enabled: true
    content: "data/instructions.html"
 
  training:
    enabled: true
    data_file: "data/training.json"
    schema_name: sentiment
    passing_criteria:
      min_correct: 8
 
  annotation:
    # Main task - always enabled
    enabled: true
 
  poststudy:
    enabled: true
    data_file: "data/feedback.json"

性能注意事项

培训数据在启动时加载
进度按会话存储在内存中
对主标注性能影响最小
考虑将复杂培训分为多个阶段

培训阶段

用例

工作原理

配置

基本设置

完整配置

通过标准

最少正确数

要求全部正确

最大错误数

每题最大错误数

组合标准

培训数据格式

多模式培训

用户体验

培训流程

反馈展示

管理员监控

示例：情感分析培训

示例：NER 培训

最佳实践

1. 从简单开始

2. 覆盖所有标签

3. 编写清晰的解释

4. 设置合理的标准

5. 包含边界情况

与工作流集成

性能注意事项

延伸阅读