Skip to content

培训阶段

在主要任务之前使用练习题培训和筛选标注者。

培训阶段

Potato 2.0 包含一个可选的培训阶段,帮助在标注者开始主要标注任务之前对其进行资格筛选。标注者回答具有已知正确答案的练习题,并获得即时的表现反馈。

用例

  • 确保标注者理解任务
  • 过滤低质量标注者
  • 在真实标注前提供指导性练习
  • 收集基线质量指标
  • 通过示例教授标注指南

工作原理

  1. 标注者完成一组培训题
  2. 每次回答后获得即时反馈
  3. 进度根据通过标准进行追踪
  4. 只有通过的标注者才能进入主要任务

配置

基本设置

yaml
phases:
  training:
    enabled: true
    data_file: "data/training_data.json"
    schema_name: sentiment  # Which annotation scheme to train
 
    # Passing criteria
    passing_criteria:
      min_correct: 8  # Must get at least 8 correct
      total_questions: 10

完整配置

yaml
phases:
  training:
    enabled: true
    data_file: "data/training_data.json"
    schema_name: sentiment
 
    passing_criteria:
      # Different criteria options (choose one or combine)
      min_correct: 8
      require_all_correct: false
      max_mistakes: 3
      max_mistakes_per_question: 2
 
    # Allow retries
    retries:
      enabled: true
      max_retries: 3
 
    # Show explanations for incorrect answers
    show_explanations: true
 
    # Randomize question order
    randomize: true

通过标准

你可以为培训阶段设置各种通过标准:

最少正确数

yaml
passing_criteria:
  min_correct: 8
  total_questions: 10

标注者必须在 10 题中至少答对 8 题。

要求全部正确

yaml
passing_criteria:
  require_all_correct: true

标注者必须答对每一题才能通过。

最大错误数

yaml
passing_criteria:
  max_mistakes: 3

标注者在累计 3 次错误后被取消资格。

每题最大错误数

yaml
passing_criteria:
  max_mistakes_per_question: 2

标注者在任何单题上犯 2 次错误后被取消资格。

组合标准

yaml
passing_criteria:
  min_correct: 8
  max_mistakes_per_question: 3

必须答对 8 题,且任何单题不能错超过 3 次。

培训数据格式

培训数据必须包含正确答案和可选的解释:

json
[
  {
    "id": "train_1",
    "text": "I absolutely love this product! Best purchase ever!",
    "correct_answers": {
      "sentiment": "Positive"
    },
    "explanation": "This text expresses strong positive sentiment with words like 'love' and 'best'."
  },
  {
    "id": "train_2",
    "text": "This is the worst service I've ever experienced.",
    "correct_answers": {
      "sentiment": "Negative"
    },
    "explanation": "The words 'worst' and the overall complaint indicate negative sentiment."
  },
  {
    "id": "train_3",
    "text": "The package arrived on time.",
    "correct_answers": {
      "sentiment": "Neutral"
    },
    "explanation": "This is a factual statement without emotional indicators."
  }
]

多模式培训

对于有多个标注模式的任务:

json
{
  "id": "train_1",
  "text": "Apple announced new iPhone features yesterday.",
  "correct_answers": {
    "sentiment": "Neutral",
    "topic": "Technology"
  },
  "explanation": {
    "sentiment": "This is a factual news statement.",
    "topic": "The text discusses Apple and iPhone, which are tech topics."
  }
}

用户体验

培训流程

  1. 用户看到"培训阶段"指示器
  2. 显示带标注表单的问题
  3. 用户提交答案
  4. 立即显示反馈:
    • 正确:绿色对勾,进入下一题
    • 错误:红色叉号,显示解释,重试选项

反馈展示

当标注者回答错误时:

  • 高亮正确答案
  • 显示提供的解释
  • 出现重试按钮(如果启用重试)
  • 显示通过标准的进度

管理员监控

在管理员仪表板中追踪培训表现:

  • 完成率
  • 平均正确答案数
  • 通过/失败率
  • 培训花费时间
  • 每题准确率

通过 /admin API 端点访问:

text
GET /api/admin/training/stats
GET /api/admin/training/user/{user_id}

示例:情感分析培训

yaml
task_name: "Sentiment Analysis"
task_dir: "."
port: 8000
 
# Main annotation data
data_files:
  - "data/reviews.json"
 
item_properties:
  id_key: id
  text_key: text
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the sentiment of this review?"
    labels:
      - Positive
      - Negative
      - Neutral
 
# Training phase configuration
phases:
  training:
    enabled: true
    data_file: "data/training_questions.json"
    schema_name: sentiment
 
    passing_criteria:
      min_correct: 8
      total_questions: 10
      max_mistakes_per_question: 2
 
    retries:
      enabled: true
      max_retries: 3
 
    show_explanations: true
    randomize: true
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

示例:NER 培训

yaml
annotation_schemes:
  - annotation_type: span
    name: entities
    description: "Highlight named entities"
    labels:
      - Person
      - Organization
      - Location
      - Date
 
phases:
  training:
    enabled: true
    data_file: "data/ner_training.json"
    schema_name: entities
 
    passing_criteria:
      min_correct: 7
      total_questions: 10
 
    show_explanations: true

片段标注的培训数据:

json
{
  "id": "train_1",
  "text": "Tim Cook announced that Apple will open a new store in New York on March 15.",
  "correct_answers": {
    "entities": [
      {"start": 0, "end": 8, "label": "Person"},
      {"start": 24, "end": 29, "label": "Organization"},
      {"start": 54, "end": 62, "label": "Location"},
      {"start": 66, "end": 74, "label": "Date"}
    ]
  },
  "explanation": "Tim Cook is a Person, Apple is an Organization, New York is a Location, and March 15 is a Date."
}

最佳实践

1. 从简单开始

先从简单的例子开始,再引入边界情况:

json
[
  {"text": "I love this!", "correct_answers": {"sentiment": "Positive"}},
  {"text": "I hate this!", "correct_answers": {"sentiment": "Negative"}},
  {"text": "It arrived yesterday.", "correct_answers": {"sentiment": "Neutral"}}
]

2. 覆盖所有标签

确保培训包含每个可能标签的示例:

json
[
  {"correct_answers": {"sentiment": "Positive"}},
  {"correct_answers": {"sentiment": "Negative"}},
  {"correct_answers": {"sentiment": "Neutral"}}
]

3. 编写清晰的解释

解释应该教授标注指南:

json
{
  "explanation": "While this text mentions a problem, the overall tone is constructive and the reviewer expresses satisfaction with the resolution. This makes it Positive rather than Negative."
}

4. 设置合理的标准

不必要时不要要求完美:

yaml
# Too strict - may lose good annotators
passing_criteria:
  require_all_correct: true
 
# Better - allows for learning
passing_criteria:
  min_correct: 8
  total_questions: 10

5. 包含边界情况

添加棘手的例子以准备标注者:

json
{
  "text": "Not bad at all, I guess it could be worse.",
  "correct_answers": {"sentiment": "Neutral"},
  "explanation": "Despite negative words like 'not bad' and 'worse', this is actually a lukewarm endorsement - neutral rather than positive or negative."
}

与工作流集成

培训与多阶段工作流集成:

yaml
phases:
  consent:
    enabled: true
    data_file: "data/consent.json"
 
  prestudy:
    enabled: true
    data_file: "data/demographics.json"
 
  instructions:
    enabled: true
    content: "data/instructions.html"
 
  training:
    enabled: true
    data_file: "data/training.json"
    schema_name: sentiment
    passing_criteria:
      min_correct: 8
 
  annotation:
    # Main task - always enabled
    enabled: true
 
  poststudy:
    enabled: true
    data_file: "data/feedback.json"

性能注意事项

  • 培训数据在启动时加载
  • 进度按会话存储在内存中
  • 对主标注性能影响最小
  • 考虑将复杂培训分为多个阶段

延伸阅读

有关实现细节,请参阅源代码文档