Skip to content

片段标注

高亮并标记文本片段,用于命名实体识别等任务。

片段标注

片段标注允许标注者选择并标记文本的一部分,常用于命名实体识别(NER)、词性标注和文本高亮任务。

基本配置

yaml
annotation_schemes:
  - annotation_type: span
    name: entities
    description: "Highlight named entities in the text"
    labels:
      - PERSON
      - ORGANIZATION
      - LOCATION

配置选项

实体标签

定义标注者可以创建的片段类型:

yaml
labels:
  - PERSON
  - ORGANIZATION
  - LOCATION
  - DATE
  - EVENT

标签颜色

自定义颜色以进行视觉区分:

yaml
label_colors:
  PERSON: "#3b82f6"
  ORGANIZATION: "#10b981"
  LOCATION: "#f59e0b"
  DATE: "#8b5cf6"
  EVENT: "#ec4899"

颜色可以是十六进制格式(#ff0000)或 RGB 格式(rgb(255, 0, 0))。

键盘快捷键

通过键盘绑定加速标注:

yaml
keyboard_shortcuts:
  PERSON: "1"
  ORGANIZATION: "2"
  LOCATION: "3"
  DATE: "4"

工具提示

为每个标签提供指导:

yaml
tooltips:
  PERSON: "Names of people, characters, or personas"
  ORGANIZATION: "Companies, agencies, institutions"
  LOCATION: "Physical locations, addresses, geographic regions"

重叠片段

允许重叠

启用可以重叠的片段:

yaml
- annotation_type: span
  name: entities
  labels:
    - PERSON
    - ROLE
  allow_overlapping: true

当同一文本可以有多个标签时(例如 "Dr. Smith" 既是 PERSON 又有 ROLE),这很有用。

禁止重叠(默认)

yaml
- annotation_type: span
  name: entities
  labels:
    - PERSON
    - ORGANIZATION
  allow_overlapping: false  # Default behavior

片段选择模式

词级选择

仅选择完整的词:

yaml
- annotation_type: span
  name: entities
  selection_mode: word
  labels:
    - ENTITY

字符级选择

允许选择部分词:

yaml
- annotation_type: span
  name: entities
  selection_mode: character
  labels:
    - ENTITY

预标注片段

加载现有标注用于审核或修正:

json
{
  "id": "doc1",
  "text": "John Smith works at Microsoft in Seattle.",
  "spans": [
    {"start": 0, "end": 10, "label": "PERSON"},
    {"start": 20, "end": 29, "label": "ORGANIZATION"},
    {"start": 33, "end": 40, "label": "LOCATION"}
  ]
}

配置加载预标注:

yaml
- annotation_type: span
  name: entities
  load_pre_annotations: true
  pre_annotation_field: spans

常见 NER 配置

标准 NER(4 种类型)

yaml
- annotation_type: span
  name: ner
  description: "Label named entities"
  labels:
    - PER    # Person
    - ORG    # Organization
    - LOC    # Location
    - MISC   # Miscellaneous
  label_colors:
    PER: "#3b82f6"
    ORG: "#10b981"
    LOC: "#f59e0b"
    MISC: "#6b7280"
  keyboard_shortcuts:
    PER: "1"
    ORG: "2"
    LOC: "3"
    MISC: "4"

扩展 NER(OntoNotes 风格)

yaml
- annotation_type: span
  name: ner_extended
  labels:
    - PERSON
    - NORP        # Nationalities, religious/political groups
    - FAC         # Facilities
    - ORG
    - GPE         # Geopolitical entities
    - LOC
    - PRODUCT
    - EVENT
    - WORK_OF_ART
    - LAW
    - LANGUAGE
    - DATE
    - TIME
    - PERCENT
    - MONEY
    - QUANTITY
    - ORDINAL
    - CARDINAL

生物医学 NER

yaml
- annotation_type: span
  name: bio_ner
  labels:
    - GENE
    - PROTEIN
    - DISEASE
    - DRUG
    - SPECIES
  label_colors:
    GENE: "#22c55e"
    PROTEIN: "#3b82f6"
    DISEASE: "#ef4444"
    DRUG: "#f59e0b"
    SPECIES: "#8b5cf6"

社交媒体 NER

yaml
- annotation_type: span
  name: social_ner
  labels:
    - PERSON
    - ORGANIZATION
    - LOCATION
    - PRODUCT
    - CREATIVE_WORK
    - GROUP

带属性的片段

为片段添加属性以进行更丰富的标注:

yaml
annotation_schemes:
  - annotation_type: span
    name: entities
    labels:
      - PERSON
      - ORGANIZATION
 
  - annotation_type: radio
    name: entity_type
    description: "What type of entity is this?"
    show_for_span: entities
    labels:
      - Named
      - Nominal
      - Pronominal

多片段方案

分别标注不同方面:

yaml
annotation_schemes:
  # Named entities
  - annotation_type: span
    name: entities
    description: "Label named entities"
    labels:
      - PERSON
      - ORGANIZATION
      - LOCATION
 
  # Sentiment expressions
  - annotation_type: span
    name: sentiment_spans
    description: "Highlight sentiment expressions"
    labels:
      - POSITIVE
      - NEGATIVE
    label_colors:
      POSITIVE: "#22c55e"
      NEGATIVE: "#ef4444"

多字段片段标注

v2.1.0 新增

片段标注可以使用 target_field 选项定位多字段数据中的特定文本字段。当您的数据包含多个文本字段并且您想在特定字段中标注片段时,这非常有用。

配置

yaml
annotation_schemes:
  - annotation_type: span
    name: source_entities
    description: "Label entities in the source text"
    target_field: "source_text"
    labels:
      - PERSON
      - ORGANIZATION
 
  - annotation_type: span
    name: summary_entities
    description: "Label entities in the summary"
    target_field: "summary"
    labels:
      - PERSON
      - ORGANIZATION

多字段数据格式

您的数据应包含单独的文本字段:

json
{
  "id": "doc1",
  "source_text": "John Smith works at Microsoft in Seattle.",
  "summary": "Smith is employed by Microsoft."
}

输出格式

使用 target_field 时,标注按字段键组织:

json
{
  "id": "doc1",
  "source_entities": {
    "source_text": [
      {"start": 0, "end": 10, "text": "John Smith", "label": "PERSON"},
      {"start": 20, "end": 29, "text": "Microsoft", "label": "ORGANIZATION"}
    ]
  },
  "summary_entities": {
    "summary": [
      {"start": 0, "end": 5, "text": "Smith", "label": "PERSON"},
      {"start": 22, "end": 31, "text": "Microsoft", "label": "ORGANIZATION"}
    ]
  }
}

完整的工作示例请参见 Potato 仓库中的 project-hub/simple_examples/simple-multi-span/

显示选项

在片段中显示标签

在高亮片段内显示标签文本:

yaml
- annotation_type: span
  name: entities
  show_label_in_span: true

下划线样式

使用下划线代替背景高亮:

yaml
- annotation_type: span
  name: entities
  display_style: underline

输出格式

片段标注以字符偏移量保存:

json
{
  "id": "doc1",
  "entities": [
    {
      "start": 0,
      "end": 10,
      "text": "John Smith",
      "label": "PERSON"
    },
    {
      "start": 20,
      "end": 29,
      "text": "Microsoft",
      "label": "ORGANIZATION"
    }
  ]
}

完整示例:NER 任务

yaml
task_name: "Named Entity Recognition"
 
data_files:
  - path: data/documents.json
    text_field: text
 
annotation_schemes:
  - annotation_type: span
    name: entities
    description: "Highlight and label all named entities"
    labels:
      - PERSON
      - ORGANIZATION
      - LOCATION
      - DATE
      - MONEY
    label_colors:
      PERSON: "#3b82f6"
      ORGANIZATION: "#10b981"
      LOCATION: "#f59e0b"
      DATE: "#8b5cf6"
      MONEY: "#ec4899"
    keyboard_shortcuts:
      PERSON: "1"
      ORGANIZATION: "2"
      LOCATION: "3"
      DATE: "4"
      MONEY: "5"
    tooltips:
      PERSON: "Names of people"
      ORGANIZATION: "Companies, agencies, institutions"
      LOCATION: "Cities, countries, addresses"
      DATE: "Dates and time expressions"
      MONEY: "Monetary values"
    allow_overlapping: false
    selection_mode: word
 
  - annotation_type: radio
    name: difficulty
    description: "How difficult was this document to annotate?"
    labels:
      - Easy
      - Medium
      - Hard

不连续片段

v2.2.0 新增

使用 allow_discontinuous 参数启用不连续文本片段。这允许标注者选择多个不相邻的文本段作为单个片段标注,适用于不连续实体或分割表达。

yaml
- annotation_type: span
  name: entities
  labels:
    - PERSON
    - ORGANIZATION
  allow_discontinuous: true

启用后,标注者可以按住修饰键同时选择额外的文本段以将其添加到当前片段。输出包含每个段的多个起始/结束对。

实体链接集成

v2.2.0 新增

通过向片段方案添加 entity_linking 配置块,可以将片段标注链接到外部知识库(Wikidata、UMLS 或自定义 REST API):

yaml
- annotation_type: span
  name: entities
  labels:
    - PERSON
    - ORGANIZATION
    - LOCATION
  entity_linking:
    enabled: true
    knowledge_bases:
      - name: wikidata
        type: wikidata
        language: en

启用实体链接后,每个片段的控制栏上会出现链接图标。点击它会打开搜索模态框以查找和链接匹配的知识库实体。详情请参阅实体链接文档。

最佳实践

  1. 使用独特的颜色以便于视觉区分
  2. 为每种实体类型提供清晰的工具提示和示例
  3. 启用键盘快捷键以加快标注速度
  4. 使用词级选择除非需要字符精度
  5. 考虑预标注以加快修正工作流程
  6. 根据标注指南测试重叠设置

延伸阅读

有关实现细节,请参阅源文档