多阶段工作流
使用调查、培训和分支逻辑构建复杂的标注工作流。
多阶段工作流
Potato 2.0 支持结构化的标注工作流,包含多个顺序阶段:知情同意、研究前调查、说明、培训、标注和研究后反馈。
可用阶段
| 阶段 | 描述 |
|---|---|
consent | 知情同意收集 |
prestudy | 标注前调查(人口统计、筛查) |
instructions | 任务指南和信息 |
training | 带反馈的练习题 |
annotation | 主要标注任务(必需) |
poststudy | 标注后调查和反馈 |
基本配置
在配置中使用 phases 部分:
yaml
phases:
consent:
enabled: true
data_file: "data/consent.json"
prestudy:
enabled: true
data_file: "data/demographics.json"
instructions:
enabled: true
content: "data/instructions.html"
training:
enabled: true
data_file: "data/training.json"
schema_name: sentiment
passing_criteria:
min_correct: 8
# annotation phase is always enabled
poststudy:
enabled: true
data_file: "data/feedback.json"调查问题类型
调查阶段支持以下问题类型:
单选(Radio)
json
{
"name": "experience",
"type": "radio",
"description": "How much annotation experience do you have?",
"labels": ["None", "Some (< 10 hours)", "Moderate", "Extensive"],
"required": true
}复选框/多选
json
{
"name": "languages",
"type": "checkbox",
"description": "What languages do you speak fluently?",
"labels": ["English", "Spanish", "French", "German", "Chinese", "Other"]
}文本输入
json
{
"name": "occupation",
"type": "text",
"description": "What is your occupation?",
"required": true
}数字输入
json
{
"name": "years_experience",
"type": "number",
"description": "Years of professional experience",
"min": 0,
"max": 50
}Likert 量表
json
{
"name": "familiarity",
"type": "likert",
"description": "How familiar are you with this topic?",
"size": 5,
"min_label": "Not familiar",
"max_label": "Very familiar"
}下拉选择
json
{
"name": "country",
"type": "select",
"description": "Select your country",
"labels": ["USA", "Canada", "UK", "Germany", "France", "Other"]
}知情同意阶段
在开始前收集知情同意:
yaml
phases:
consent:
enabled: true
data_file: "data/consent.json"consent.json:
json
[
{
"name": "consent_agreement",
"type": "radio",
"description": "I have read and understood the research consent form and agree to participate.",
"labels": ["I agree", "I do not agree"],
"right_label": "I agree",
"required": true
}
]right_label 字段指定继续所需的答案。
研究前调查
收集人口统计或筛查问题:
yaml
phases:
prestudy:
enabled: true
data_file: "data/demographics.json"demographics.json:
json
[
{
"name": "age_range",
"type": "radio",
"description": "What is your age range?",
"labels": ["18-24", "25-34", "35-44", "45-54", "55+"],
"required": true
},
{
"name": "education",
"type": "radio",
"description": "Highest level of education completed",
"labels": ["High school", "Bachelor's degree", "Master's degree", "Doctoral degree", "Other"],
"required": true
},
{
"name": "english_native",
"type": "radio",
"description": "Is English your native language?",
"labels": ["Yes", "No"],
"required": true
}
]说明阶段
显示任务说明:
yaml
phases:
instructions:
enabled: true
content: "data/instructions.html"或使用内联内容:
yaml
phases:
instructions:
enabled: true
inline_content: |
<h2>Task Instructions</h2>
<p>In this task, you will classify the sentiment of product reviews.</p>
<ul>
<li><strong>Positive:</strong> Expresses satisfaction or praise</li>
<li><strong>Negative:</strong> Expresses dissatisfaction or criticism</li>
<li><strong>Neutral:</strong> Factual or mixed sentiment</li>
</ul>培训阶段
带反馈的练习题(详见培训阶段):
yaml
phases:
training:
enabled: true
data_file: "data/training.json"
schema_name: sentiment
passing_criteria:
min_correct: 8
total_questions: 10
show_explanations: true研究后调查
标注后收集反馈:
yaml
phases:
poststudy:
enabled: true
data_file: "data/feedback.json"feedback.json:
json
[
{
"name": "difficulty",
"type": "likert",
"description": "How difficult was this task?",
"size": 5,
"min_label": "Very easy",
"max_label": "Very difficult"
},
{
"name": "clarity",
"type": "likert",
"description": "How clear were the instructions?",
"size": 5,
"min_label": "Very unclear",
"max_label": "Very clear"
},
{
"name": "suggestions",
"type": "text",
"description": "Any suggestions for improvement?",
"textarea": true,
"required": false
}
]内置模板
Potato 包含常见调查问题的预定义标签集:
| 模板 | 标签 |
|---|---|
countries | 国家列表 |
languages | 常见语言 |
ethnicity | 民族选项 |
religion | 宗教选项 |
在问题中使用模板:
json
{
"name": "country",
"type": "select",
"description": "Select your country",
"template": "countries"
}自由文本字段
在结构化问题旁添加可选的文本输入:
json
{
"name": "topics",
"type": "checkbox",
"description": "Which topics interest you?",
"labels": ["Technology", "Sports", "Politics", "Entertainment"],
"free_response": true,
"free_response_label": "Other (please specify)"
}页面标题
自定义调查部分标题:
json
{
"page_header": "Demographics Survey",
"questions": [
{"name": "age", "type": "radio", ...},
{"name": "gender", "type": "radio", ...}
]
}完整示例
yaml
task_name: "Sentiment Analysis Study"
task_dir: "."
port: 8000
# Data configuration
data_files:
- "data/reviews.json"
item_properties:
id_key: id
text_key: text
# Annotation scheme
annotation_schemes:
- annotation_type: radio
name: sentiment
description: "What is the sentiment of this review?"
labels:
- Positive
- Negative
- Neutral
sequential_key_binding: true
# Multi-phase workflow
phases:
consent:
enabled: true
data_file: "data/consent.json"
prestudy:
enabled: true
data_file: "data/demographics.json"
instructions:
enabled: true
content: "data/instructions.html"
training:
enabled: true
data_file: "data/training.json"
schema_name: sentiment
passing_criteria:
min_correct: 8
total_questions: 10
retries:
enabled: true
max_retries: 2
show_explanations: true
# annotation phase is always enabled
poststudy:
enabled: true
data_file: "data/feedback.json"
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
# User access
allow_all_users: true旧版配置
旧的 surveyflow 配置格式仍然支持向后兼容:
yaml
surveyflow:
enabled: true
phases:
- name: pre_survey
type: survey
questions: survey_questions.json
- name: main_annotation
type: annotation但是,我们建议新项目迁移到新的 phases 格式。
最佳实践
1. 保持调查简洁
过长的调查会降低完成率。只关注必要的问题。
2. 复杂任务使用培训
培训阶段提高标注质量,特别是对于细微差别的任务。
3. 设置合理的通过标准
yaml
# Too strict - may exclude good annotators
passing_criteria:
require_all_correct: true
# Better - allows for learning
passing_criteria:
min_correct: 8
total_questions: 104. 提供清晰的说明
在说明阶段包含示例以明确预期。
5. 测试完整流程
部署前自己完成整个工作流以发现问题。
6. 明智使用必填字段
只在必要时将问题标记为必填——可选问题能获得更好的回答质量。
众包集成
对于 Prolific 或 MTurk,配置完成代码:
yaml
phases:
poststudy:
enabled: true
data_file: "data/feedback.json"
show_completion_code: true
completion_code_format: "POTATO-{user_id}-{timestamp}"详见众包。