片段标注
高亮并标记文本片段,用于命名实体识别等任务。
片段标注
片段标注允许标注者选择并标记文本的一部分,常用于命名实体识别(NER)、词性标注和文本高亮任务。
基本配置
annotation_schemes:
- annotation_type: span
name: entities
description: "Highlight named entities in the text"
labels:
- PERSON
- ORGANIZATION
- LOCATION配置选项
实体标签
定义标注者可以创建的片段类型:
labels:
- PERSON
- ORGANIZATION
- LOCATION
- DATE
- EVENT标签颜色
自定义颜色以进行视觉区分:
label_colors:
PERSON: "#3b82f6"
ORGANIZATION: "#10b981"
LOCATION: "#f59e0b"
DATE: "#8b5cf6"
EVENT: "#ec4899"颜色可以是十六进制格式(#ff0000)或 RGB 格式(rgb(255, 0, 0))。
键盘快捷键
通过键盘绑定加速标注:
keyboard_shortcuts:
PERSON: "1"
ORGANIZATION: "2"
LOCATION: "3"
DATE: "4"工具提示
为每个标签提供指导:
tooltips:
PERSON: "Names of people, characters, or personas"
ORGANIZATION: "Companies, agencies, institutions"
LOCATION: "Physical locations, addresses, geographic regions"重叠片段
允许重叠
启用可以重叠的片段:
- annotation_type: span
name: entities
labels:
- PERSON
- ROLE
allow_overlapping: true当同一文本可以有多个标签时(例如 "Dr. Smith" 既是 PERSON 又有 ROLE),这很有用。
禁止重叠(默认)
- annotation_type: span
name: entities
labels:
- PERSON
- ORGANIZATION
allow_overlapping: false # Default behavior片段选择模式
词级选择
仅选择完整的词:
- annotation_type: span
name: entities
selection_mode: word
labels:
- ENTITY字符级选择
允许选择部分词:
- annotation_type: span
name: entities
selection_mode: character
labels:
- ENTITY预标注片段
加载现有标注用于审核或修正:
{
"id": "doc1",
"text": "John Smith works at Microsoft in Seattle.",
"spans": [
{"start": 0, "end": 10, "label": "PERSON"},
{"start": 20, "end": 29, "label": "ORGANIZATION"},
{"start": 33, "end": 40, "label": "LOCATION"}
]
}配置加载预标注:
- annotation_type: span
name: entities
load_pre_annotations: true
pre_annotation_field: spans常见 NER 配置
标准 NER(4 种类型)
- annotation_type: span
name: ner
description: "Label named entities"
labels:
- PER # Person
- ORG # Organization
- LOC # Location
- MISC # Miscellaneous
label_colors:
PER: "#3b82f6"
ORG: "#10b981"
LOC: "#f59e0b"
MISC: "#6b7280"
keyboard_shortcuts:
PER: "1"
ORG: "2"
LOC: "3"
MISC: "4"扩展 NER(OntoNotes 风格)
- annotation_type: span
name: ner_extended
labels:
- PERSON
- NORP # Nationalities, religious/political groups
- FAC # Facilities
- ORG
- GPE # Geopolitical entities
- LOC
- PRODUCT
- EVENT
- WORK_OF_ART
- LAW
- LANGUAGE
- DATE
- TIME
- PERCENT
- MONEY
- QUANTITY
- ORDINAL
- CARDINAL生物医学 NER
- annotation_type: span
name: bio_ner
labels:
- GENE
- PROTEIN
- DISEASE
- DRUG
- SPECIES
label_colors:
GENE: "#22c55e"
PROTEIN: "#3b82f6"
DISEASE: "#ef4444"
DRUG: "#f59e0b"
SPECIES: "#8b5cf6"社交媒体 NER
- annotation_type: span
name: social_ner
labels:
- PERSON
- ORGANIZATION
- LOCATION
- PRODUCT
- CREATIVE_WORK
- GROUP带属性的片段
为片段添加属性以进行更丰富的标注:
annotation_schemes:
- annotation_type: span
name: entities
labels:
- PERSON
- ORGANIZATION
- annotation_type: radio
name: entity_type
description: "What type of entity is this?"
show_for_span: entities
labels:
- Named
- Nominal
- Pronominal多片段方案
分别标注不同方面:
annotation_schemes:
# Named entities
- annotation_type: span
name: entities
description: "Label named entities"
labels:
- PERSON
- ORGANIZATION
- LOCATION
# Sentiment expressions
- annotation_type: span
name: sentiment_spans
description: "Highlight sentiment expressions"
labels:
- POSITIVE
- NEGATIVE
label_colors:
POSITIVE: "#22c55e"
NEGATIVE: "#ef4444"多字段片段标注
v2.1.0 新增
片段标注可以使用 target_field 选项定位多字段数据中的特定文本字段。当您的数据包含多个文本字段并且您想在特定字段中标注片段时,这非常有用。
配置
annotation_schemes:
- annotation_type: span
name: source_entities
description: "Label entities in the source text"
target_field: "source_text"
labels:
- PERSON
- ORGANIZATION
- annotation_type: span
name: summary_entities
description: "Label entities in the summary"
target_field: "summary"
labels:
- PERSON
- ORGANIZATION多字段数据格式
您的数据应包含单独的文本字段:
{
"id": "doc1",
"source_text": "John Smith works at Microsoft in Seattle.",
"summary": "Smith is employed by Microsoft."
}输出格式
使用 target_field 时,标注按字段键组织:
{
"id": "doc1",
"source_entities": {
"source_text": [
{"start": 0, "end": 10, "text": "John Smith", "label": "PERSON"},
{"start": 20, "end": 29, "text": "Microsoft", "label": "ORGANIZATION"}
]
},
"summary_entities": {
"summary": [
{"start": 0, "end": 5, "text": "Smith", "label": "PERSON"},
{"start": 22, "end": 31, "text": "Microsoft", "label": "ORGANIZATION"}
]
}
}完整的工作示例请参见 Potato 仓库中的 project-hub/simple_examples/simple-multi-span/。
显示选项
在片段中显示标签
在高亮片段内显示标签文本:
- annotation_type: span
name: entities
show_label_in_span: true下划线样式
使用下划线代替背景高亮:
- annotation_type: span
name: entities
display_style: underline输出格式
片段标注以字符偏移量保存:
{
"id": "doc1",
"entities": [
{
"start": 0,
"end": 10,
"text": "John Smith",
"label": "PERSON"
},
{
"start": 20,
"end": 29,
"text": "Microsoft",
"label": "ORGANIZATION"
}
]
}完整示例:NER 任务
task_name: "Named Entity Recognition"
data_files:
- path: data/documents.json
text_field: text
annotation_schemes:
- annotation_type: span
name: entities
description: "Highlight and label all named entities"
labels:
- PERSON
- ORGANIZATION
- LOCATION
- DATE
- MONEY
label_colors:
PERSON: "#3b82f6"
ORGANIZATION: "#10b981"
LOCATION: "#f59e0b"
DATE: "#8b5cf6"
MONEY: "#ec4899"
keyboard_shortcuts:
PERSON: "1"
ORGANIZATION: "2"
LOCATION: "3"
DATE: "4"
MONEY: "5"
tooltips:
PERSON: "Names of people"
ORGANIZATION: "Companies, agencies, institutions"
LOCATION: "Cities, countries, addresses"
DATE: "Dates and time expressions"
MONEY: "Monetary values"
allow_overlapping: false
selection_mode: word
- annotation_type: radio
name: difficulty
description: "How difficult was this document to annotate?"
labels:
- Easy
- Medium
- Hard不连续片段
v2.2.0 新增
使用 allow_discontinuous 参数启用不连续文本片段。这允许标注者选择多个不相邻的文本段作为单个片段标注,适用于不连续实体或分割表达。
- annotation_type: span
name: entities
labels:
- PERSON
- ORGANIZATION
allow_discontinuous: true启用后,标注者可以按住修饰键同时选择额外的文本段以将其添加到当前片段。输出包含每个段的多个起始/结束对。
实体链接集成
v2.2.0 新增
通过向片段方案添加 entity_linking 配置块,可以将片段标注链接到外部知识库(Wikidata、UMLS 或自定义 REST API):
- annotation_type: span
name: entities
labels:
- PERSON
- ORGANIZATION
- LOCATION
entity_linking:
enabled: true
knowledge_bases:
- name: wikidata
type: wikidata
language: en启用实体链接后,每个片段的控制栏上会出现链接图标。点击它会打开搜索模态框以查找和链接匹配的知识库实体。详情请参阅实体链接文档。
最佳实践
- 使用独特的颜色以便于视觉区分
- 为每种实体类型提供清晰的工具提示和示例
- 启用键盘快捷键以加快标注速度
- 使用词级选择除非需要字符精度
- 考虑预标注以加快修正工作流程
- 根据标注指南测试重叠设置
延伸阅读
- 实体链接 - 将片段链接到知识库
- 共指链 - 分组共指提及
- 事件标注 - 带片段参数的 N 元事件结构
- 片段链接 - 在片段之间创建关系
- 实例显示 - 带片段目标的多字段内容显示
- UI 配置 - 自定义片段颜色
- 生产力功能 - 键盘快捷键
有关实现细节,请参阅源文档。