Skip to content
Tutorials3 min read

构建你的第一个命名实体识别标注任务

逐步教程:创建带有 span 标签和键盘快捷键的命名实体识别标注任务。

Potato Team·

构建你的第一个命名实体识别标注任务

命名实体识别(NER)是最常见的 NLP 任务之一。在本教程中,你将学习如何创建一个完整的 NER 标注界面,包含文本高亮、键盘快捷键和实体类型选择。

我们要构建的内容

完成本教程后,你将拥有一个标注界面,标注者可以:

  • 通过点击和拖动来高亮文本片段
  • 分配实体类型(人物、组织、地点等)
  • 使用键盘快捷键加速标注
  • 编辑或删除已有标注

前置条件

  • 已安装 Potato(pip install potato-annotation
  • 基本的 YAML 知识
  • 待标注的示例文本数据

第1步:配置标注方案

创建 config.yaml 文件:

yaml
annotation_task_name: "Named Entity Recognition"
 
data_files:
  - data/sentences.json
 
item_properties:
  id_key: id
  text_key: text
 
# Enable span annotation
annotation_schemes:
  - annotation_type: span
    name: entities
    description: "Highlight and label named entities in the text"
    labels:
      - name: PER
        description: "Person names"
        color: "#FF6B6B"
        keyboard_shortcut: "p"
      - name: ORG
        description: "Organizations"
        color: "#4ECDC4"
        keyboard_shortcut: "o"
      - name: LOC
        description: "Locations"
        color: "#45B7D1"
        keyboard_shortcut: "l"
      - name: DATE
        description: "Dates and times"
        color: "#96CEB4"
        keyboard_shortcut: "d"
      - name: MISC
        description: "Miscellaneous entities"
        color: "#FFEAA7"
        keyboard_shortcut: "m"
    min_spans: 0  # Allow sentences with no entities

第2步:准备数据

创建 data/sentences.json,包含你的文本数据:

json
{"id": "1", "text": "Apple Inc. announced that CEO Tim Cook will visit Paris next Tuesday."}
{"id": "2", "text": "The United Nations headquarters in New York hosted delegates from Japan."}
{"id": "3", "text": "Dr. Sarah Johnson published her research at Stanford University in March 2024."}

第3步:添加标注指南

用清晰的指南帮助你的标注者:

yaml
# Add to config.yaml
annotation_guidelines:
  title: "NER Annotation Guidelines"
  content: |
    ## Entity Types
 
    **PER (Person)**: Names of people, including fictional characters
    - Examples: "John Smith", "Dr. Johnson", "Batman"
 
    **ORG (Organization)**: Companies, institutions, agencies
    - Examples: "Apple Inc.", "United Nations", "Stanford University"
 
    **LOC (Location)**: Places, including countries, cities, landmarks
    - Examples: "Paris", "New York", "Mount Everest"
 
    **DATE**: Dates, times, and temporal expressions
    - Examples: "Tuesday", "March 2024", "next week"
 
    **MISC**: Other named entities not fitting above categories
    - Examples: "Nobel Prize", "iPhone", "COVID-19"
 
    ## Annotation Rules
    1. Include titles (Dr., Mr.) with person names
    2. For nested entities, annotate the largest meaningful span
    3. Don't include articles (the, a) in entity spans

第4步:开始标注

启动你的 NER 任务:

bash
potato start config.yaml

标注工作流程

  1. 选择文本:点击并拖动以高亮一个片段
  2. 选择实体类型:点击标签按钮或使用键盘快捷键
  3. 编辑标注:点击已有的 span 进行修改或删除
  4. 提交:完成后按 Enter 或点击提交

第5步:查看输出

标注结果以 JSONL 格式保存:

json
{
  "id": "1",
  "text": "Apple Inc. announced that CEO Tim Cook will visit Paris next Tuesday.",
  "annotations": {
    "entities": [
      {"start": 0, "end": 10, "label": "ORG", "text": "Apple Inc."},
      {"start": 30, "end": 38, "label": "PER", "text": "Tim Cook"},
      {"start": 50, "end": 55, "label": "LOC", "text": "Paris"},
      {"start": 61, "end": 73, "label": "DATE", "text": "next Tuesday"}
    ]
  }
}

更好的 NER 标注技巧

  1. 一致的指南:清晰的规则减少分歧
  2. 训练示例:在标注者开始前展示边缘案例
  3. 定期校准:团队讨论困难案例
  4. 衡量一致性:使用标注者间一致性来识别问题

下一步


需要帮助?查看我们的 span 标注文档获取更多详情。