Tutorials3 min read
构建你的第一个命名实体识别标注任务
逐步教程:创建带有 span 标签和键盘快捷键的命名实体识别标注任务。
Potato Team·
构建你的第一个命名实体识别标注任务
命名实体识别(NER)是最常见的 NLP 任务之一。在本教程中,你将学习如何创建一个完整的 NER 标注界面,包含文本高亮、键盘快捷键和实体类型选择。
我们要构建的内容
完成本教程后,你将拥有一个标注界面,标注者可以:
- 通过点击和拖动来高亮文本片段
- 分配实体类型(人物、组织、地点等)
- 使用键盘快捷键加速标注
- 编辑或删除已有标注
前置条件
- 已安装 Potato(
pip install potato-annotation) - 基本的 YAML 知识
- 待标注的示例文本数据
第1步:配置标注方案
创建 config.yaml 文件:
yaml
annotation_task_name: "Named Entity Recognition"
data_files:
- data/sentences.json
item_properties:
id_key: id
text_key: text
# Enable span annotation
annotation_schemes:
- annotation_type: span
name: entities
description: "Highlight and label named entities in the text"
labels:
- name: PER
description: "Person names"
color: "#FF6B6B"
keyboard_shortcut: "p"
- name: ORG
description: "Organizations"
color: "#4ECDC4"
keyboard_shortcut: "o"
- name: LOC
description: "Locations"
color: "#45B7D1"
keyboard_shortcut: "l"
- name: DATE
description: "Dates and times"
color: "#96CEB4"
keyboard_shortcut: "d"
- name: MISC
description: "Miscellaneous entities"
color: "#FFEAA7"
keyboard_shortcut: "m"
min_spans: 0 # Allow sentences with no entities第2步:准备数据
创建 data/sentences.json,包含你的文本数据:
json
{"id": "1", "text": "Apple Inc. announced that CEO Tim Cook will visit Paris next Tuesday."}
{"id": "2", "text": "The United Nations headquarters in New York hosted delegates from Japan."}
{"id": "3", "text": "Dr. Sarah Johnson published her research at Stanford University in March 2024."}第3步:添加标注指南
用清晰的指南帮助你的标注者:
yaml
# Add to config.yaml
annotation_guidelines:
title: "NER Annotation Guidelines"
content: |
## Entity Types
**PER (Person)**: Names of people, including fictional characters
- Examples: "John Smith", "Dr. Johnson", "Batman"
**ORG (Organization)**: Companies, institutions, agencies
- Examples: "Apple Inc.", "United Nations", "Stanford University"
**LOC (Location)**: Places, including countries, cities, landmarks
- Examples: "Paris", "New York", "Mount Everest"
**DATE**: Dates, times, and temporal expressions
- Examples: "Tuesday", "March 2024", "next week"
**MISC**: Other named entities not fitting above categories
- Examples: "Nobel Prize", "iPhone", "COVID-19"
## Annotation Rules
1. Include titles (Dr., Mr.) with person names
2. For nested entities, annotate the largest meaningful span
3. Don't include articles (the, a) in entity spans第4步:开始标注
启动你的 NER 任务:
bash
potato start config.yaml标注工作流程
- 选择文本:点击并拖动以高亮一个片段
- 选择实体类型:点击标签按钮或使用键盘快捷键
- 编辑标注:点击已有的 span 进行修改或删除
- 提交:完成后按 Enter 或点击提交
第5步:查看输出
标注结果以 JSONL 格式保存:
json
{
"id": "1",
"text": "Apple Inc. announced that CEO Tim Cook will visit Paris next Tuesday.",
"annotations": {
"entities": [
{"start": 0, "end": 10, "label": "ORG", "text": "Apple Inc."},
{"start": 30, "end": 38, "label": "PER", "text": "Tim Cook"},
{"start": 50, "end": 55, "label": "LOC", "text": "Paris"},
{"start": 61, "end": 73, "label": "DATE", "text": "next Tuesday"}
]
}
}更好的 NER 标注技巧
- 一致的指南:清晰的规则减少分歧
- 训练示例:在标注者开始前展示边缘案例
- 定期校准:团队讨论困难案例
- 衡量一致性:使用标注者间一致性来识别问题
下一步
- 添加训练阶段以引导标注者
- 设置多标注者以实现冗余
- 导出为 Hugging Face 格式用于模型训练
需要帮助?查看我们的 span 标注文档获取更多详情。