CoNLL-2003 NER with Triage

Named entity recognition with a triage pre-annotation step, based on the CoNLL-2003 Shared Task (Tjong Kim Sang & De Meulder, CoNLL 2003). Annotators first flag whether a sentence contains entities worth annotating, then mark spans for Person, Organization, Location, and Miscellaneous entities.

Configuration Fileconfig.yaml

yaml

# CoNLL-2003 NER with Triage
# Based on Tjong Kim Sang & De Meulder, CoNLL 2003
# Paper: https://aclanthology.org/W03-0419/
# Dataset: https://www.clips.uantwerpen.be/conll2003/ner/
#
# Two-stage annotation process:
# 1. Triage: Quickly flag whether the text contains named entities
# 2. Span annotation: Mark entity boundaries and assign types
#
# Entity types (CoNLL-2003 standard):
# - PER: Person names (e.g., "John Smith", "Dr. Johnson")
# - ORG: Organization names (e.g., "Microsoft", "United Nations")
# - LOC: Location names (e.g., "Paris", "Mount Everest")
# - MISC: Miscellaneous entities (e.g., nationalities, events, works of art)
#
# Guidelines:
# - Mark the full extent of the entity mention
# - Include titles only if they are part of the name
# - Nested entities: annotate the outermost entity

annotation_task_name: "CoNLL-2003 NER with Triage"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: triage
    name: entity_triage
    description: "Flag whether this text contains named entities worth annotating"

  - annotation_type: span
    name: named_entities
    description: "Highlight and label all named entities in the text"
    labels:
      - "PER"
      - "ORG"
      - "LOC"
      - "MISC"
    keyboard_shortcuts:
      "PER": "1"
      "ORG": "2"
      "LOC": "3"
      "MISC": "4"
    tooltips:
      "PER": "Person names including first, last, or full names"
      "ORG": "Organization names: companies, agencies, institutions"
      "LOC": "Location names: cities, countries, geographic features"
      "MISC": "Miscellaneous: nationalities, events, languages, works of art"

annotation_instructions: |
  Annotate named entities in news text:
  1. First, use the triage tool to indicate whether the text contains entities.
  2. If entities are present, highlight each entity span and assign a type.
  3. Entity types: PER (person), ORG (organization), LOC (location), MISC (other).
  4. Mark the full span of each entity mention.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Text:</strong>
      <p style="font-size: 16px; line-height: 1.8; margin: 8px 0 0 0;">{{text}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 200
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "conll_001",
    "text": "German midfielder Michael Ballack scored twice as Bayern Munich defeated Real Madrid 3-1 in the Champions League quarter-final at the Allianz Arena on Tuesday."
  },
  {
    "id": "conll_002",
    "text": "The United Nations Security Council voted unanimously to impose new sanctions on North Korea following its latest missile test over the Sea of Japan."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/named-entity-recognition/conll2003-ner-triage
potato start config.yaml

Details

Annotation Types

triagespan

Domain

NLP

Use Cases

Named Entity RecognitionInformation ExtractionText Triage

Related Designs

Aspect-Based Sentiment Analysis

Identification of aspect terms in review text with sentiment polarity classification for each aspect. Based on SemEval-2016 Task 5 (ABSA).

spanradio

BioNLP 2011 - Gene Regulation Event Extraction

Biomedical event extraction for gene regulation, based on the BioNLP 2011 Shared Task (Kim et al., ACL Workshop 2011). Annotators identify biological entities and mark regulatory events such as gene expression, transcription, and protein catabolism in scientific abstracts.

event_annotationspan

Causal Medical Claim Detection and PICO Extraction

Detection of causal claims in medical texts and extraction of PICO (Population, Intervention, Comparator, Outcome) elements. Based on SemEval-2023 Task 8 (Khetan et al.).