培训阶段
在主要任务之前使用练习题培训和筛选标注者。
培训阶段
Potato 2.0 包含一个可选的培训阶段,帮助在标注者开始主要标注任务之前对其进行资格筛选。标注者回答具有已知正确答案的练习题,并获得即时的表现反馈。
用例
- 确保标注者理解任务
- 过滤低质量标注者
- 在真实标注前提供指导性练习
- 收集基线质量指标
- 通过示例教授标注指南
工作原理
- 标注者完成一组培训题
- 每次回答后获得即时反馈
- 进度根据通过标准进行追踪
- 只有通过的标注者才能进入主要任务
配置
基本设置
yaml
phases:
training:
enabled: true
data_file: "data/training_data.json"
schema_name: sentiment # Which annotation scheme to train
# Passing criteria
passing_criteria:
min_correct: 8 # Must get at least 8 correct
total_questions: 10完整配置
yaml
phases:
training:
enabled: true
data_file: "data/training_data.json"
schema_name: sentiment
passing_criteria:
# Different criteria options (choose one or combine)
min_correct: 8
require_all_correct: false
max_mistakes: 3
max_mistakes_per_question: 2
# Allow retries
retries:
enabled: true
max_retries: 3
# Show explanations for incorrect answers
show_explanations: true
# Randomize question order
randomize: true通过标准
你可以为培训阶段设置各种通过标准:
最少正确数
yaml
passing_criteria:
min_correct: 8
total_questions: 10标注者必须在 10 题中至少答对 8 题。
要求全部正确
yaml
passing_criteria:
require_all_correct: true标注者必须答对每一题才能通过。
最大错误数
yaml
passing_criteria:
max_mistakes: 3标注者在累计 3 次错误后被取消资格。
每题最大错误数
yaml
passing_criteria:
max_mistakes_per_question: 2标注者在任何单题上犯 2 次错误后被取消资格。
组合标准
yaml
passing_criteria:
min_correct: 8
max_mistakes_per_question: 3必须答对 8 题,且任何单题不能错超过 3 次。
培训数据格式
培训数据必须包含正确答案和可选的解释:
json
[
{
"id": "train_1",
"text": "I absolutely love this product! Best purchase ever!",
"correct_answers": {
"sentiment": "Positive"
},
"explanation": "This text expresses strong positive sentiment with words like 'love' and 'best'."
},
{
"id": "train_2",
"text": "This is the worst service I've ever experienced.",
"correct_answers": {
"sentiment": "Negative"
},
"explanation": "The words 'worst' and the overall complaint indicate negative sentiment."
},
{
"id": "train_3",
"text": "The package arrived on time.",
"correct_answers": {
"sentiment": "Neutral"
},
"explanation": "This is a factual statement without emotional indicators."
}
]多模式培训
对于有多个标注模式的任务:
json
{
"id": "train_1",
"text": "Apple announced new iPhone features yesterday.",
"correct_answers": {
"sentiment": "Neutral",
"topic": "Technology"
},
"explanation": {
"sentiment": "This is a factual news statement.",
"topic": "The text discusses Apple and iPhone, which are tech topics."
}
}用户体验
培训流程
- 用户看到"培训阶段"指示器
- 显示带标注表单的问题
- 用户提交答案
- 立即显示反馈:
- 正确:绿色对勾,进入下一题
- 错误:红色叉号,显示解释,重试选项
反馈展示
当标注者回答错误时:
- 高亮正确答案
- 显示提供的解释
- 出现重试按钮(如果启用重试)
- 显示通过标准的进度
管理员监控
在管理员仪表板中追踪培训表现:
- 完成率
- 平均正确答案数
- 通过/失败率
- 培训花费时间
- 每题准确率
通过 /admin API 端点访问:
text
GET /api/admin/training/stats
GET /api/admin/training/user/{user_id}
示例:情感分析培训
yaml
task_name: "Sentiment Analysis"
task_dir: "."
port: 8000
# Main annotation data
data_files:
- "data/reviews.json"
item_properties:
id_key: id
text_key: text
annotation_schemes:
- annotation_type: radio
name: sentiment
description: "What is the sentiment of this review?"
labels:
- Positive
- Negative
- Neutral
# Training phase configuration
phases:
training:
enabled: true
data_file: "data/training_questions.json"
schema_name: sentiment
passing_criteria:
min_correct: 8
total_questions: 10
max_mistakes_per_question: 2
retries:
enabled: true
max_retries: 3
show_explanations: true
randomize: true
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true示例:NER 培训
yaml
annotation_schemes:
- annotation_type: span
name: entities
description: "Highlight named entities"
labels:
- Person
- Organization
- Location
- Date
phases:
training:
enabled: true
data_file: "data/ner_training.json"
schema_name: entities
passing_criteria:
min_correct: 7
total_questions: 10
show_explanations: true片段标注的培训数据:
json
{
"id": "train_1",
"text": "Tim Cook announced that Apple will open a new store in New York on March 15.",
"correct_answers": {
"entities": [
{"start": 0, "end": 8, "label": "Person"},
{"start": 24, "end": 29, "label": "Organization"},
{"start": 54, "end": 62, "label": "Location"},
{"start": 66, "end": 74, "label": "Date"}
]
},
"explanation": "Tim Cook is a Person, Apple is an Organization, New York is a Location, and March 15 is a Date."
}最佳实践
1. 从简单开始
先从简单的例子开始,再引入边界情况:
json
[
{"text": "I love this!", "correct_answers": {"sentiment": "Positive"}},
{"text": "I hate this!", "correct_answers": {"sentiment": "Negative"}},
{"text": "It arrived yesterday.", "correct_answers": {"sentiment": "Neutral"}}
]2. 覆盖所有标签
确保培训包含每个可能标签的示例:
json
[
{"correct_answers": {"sentiment": "Positive"}},
{"correct_answers": {"sentiment": "Negative"}},
{"correct_answers": {"sentiment": "Neutral"}}
]3. 编写清晰的解释
解释应该教授标注指南:
json
{
"explanation": "While this text mentions a problem, the overall tone is constructive and the reviewer expresses satisfaction with the resolution. This makes it Positive rather than Negative."
}4. 设置合理的标准
不必要时不要要求完美:
yaml
# Too strict - may lose good annotators
passing_criteria:
require_all_correct: true
# Better - allows for learning
passing_criteria:
min_correct: 8
total_questions: 105. 包含边界情况
添加棘手的例子以准备标注者:
json
{
"text": "Not bad at all, I guess it could be worse.",
"correct_answers": {"sentiment": "Neutral"},
"explanation": "Despite negative words like 'not bad' and 'worse', this is actually a lukewarm endorsement - neutral rather than positive or negative."
}与工作流集成
培训与多阶段工作流集成:
yaml
phases:
consent:
enabled: true
data_file: "data/consent.json"
prestudy:
enabled: true
data_file: "data/demographics.json"
instructions:
enabled: true
content: "data/instructions.html"
training:
enabled: true
data_file: "data/training.json"
schema_name: sentiment
passing_criteria:
min_correct: 8
annotation:
# Main task - always enabled
enabled: true
poststudy:
enabled: true
data_file: "data/feedback.json"性能注意事项
- 培训数据在启动时加载
- 进度按会话存储在内存中
- 对主标注性能影响最小
- 考虑将复杂培训分为多个阶段
延伸阅读
有关实现细节,请参阅源代码文档。