Named Entity Disambiguation (AIDA-CoNLL)
Named entity disambiguation and linking to Wikidata knowledge base based on the AIDA-CoNLL dataset. Annotators identify named entity mentions in news text, classify them by type (PER, ORG, LOC, MISC), and link them to their corresponding Wikidata entities using QIDs, handling ambiguous references and NIL entities.
配置文件config.yaml
# Named Entity Disambiguation (AIDA-CoNLL)
# Based on Hoffart et al., EMNLP 2011
#
# This configuration supports entity mention detection and Wikidata
# entity linking for news text from the AIDA-CoNLL dataset.
#
# Entity Types (CoNLL scheme):
# - PER: Person names (individuals, fictional characters)
# - ORG: Organizations (companies, agencies, teams, institutions)
# - LOC: Locations (countries, cities, geographic features)
# - MISC: Miscellaneous named entities (events, products, nationalities, works)
#
# Annotation Guidelines:
# 1. Highlight all named entity mentions in the text
# 2. Classify each mention as PER, ORG, LOC, or MISC
# 3. For each mention, enter the Wikidata QID (e.g., Q5284 for Bill Gates)
# 4. If the entity has no Wikidata entry, mark as "nil-entity"
# 5. If the mention is ambiguous between multiple entities, mark as "ambiguous"
# 6. Use keyboard shortcuts 1-4 for fast entity type selection
#
# Disambiguation Tips:
# - Use surrounding context to resolve ambiguity (e.g., "Washington" could be
# a person, city, or state)
# - "Paris" in sports context likely refers to Paris Saint-Germain (ORG)
# - Consider the document topic and domain when disambiguating
# - When truly ambiguous, mark as "ambiguous" and add notes
annotation_task_name: "Named Entity Disambiguation (AIDA-CoNLL)"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
# Step 1: Span annotation for entity mentions
- annotation_type: span
name: entity_mentions
description: "Highlight all named entity mentions in the text and classify by type."
labels:
- "PER"
- "ORG"
- "LOC"
- "MISC"
label_colors:
"PER": "#ef4444"
"ORG": "#3b82f6"
"LOC": "#22c55e"
"MISC": "#f59e0b"
keyboard_shortcuts:
"PER": "1"
"ORG": "2"
"LOC": "3"
"MISC": "4"
tooltips:
"PER": "Person names: individuals, fictional characters (e.g., 'Barack Obama', 'Sherlock Holmes')"
"ORG": "Organizations: companies, agencies, teams, institutions (e.g., 'Google', 'United Nations', 'FC Barcelona')"
"LOC": "Locations: countries, cities, geographic features (e.g., 'France', 'Mount Everest', 'Amazon River')"
"MISC": "Miscellaneous: events, products, nationalities, works of art (e.g., 'Nobel Prize', 'iPhone', 'French')"
allow_overlapping: false
# Step 2: Wikidata QID entry
- annotation_type: text
name: wikidata_qid
description: "Enter Wikidata QID for the highlighted entity (e.g., Q5284 for Bill Gates). Leave blank if unknown."
# Step 3: Entity status
- annotation_type: radio
name: entity_status
description: "What is the linking status of this entity mention?"
labels:
- "linkable"
- "nil-entity"
- "ambiguous"
- "not-an-entity"
tooltips:
"linkable": "Entity can be unambiguously linked to a Wikidata entry"
"nil-entity": "Entity is real but has no Wikidata entry (e.g., obscure local business)"
"ambiguous": "Entity mention is genuinely ambiguous between multiple Wikidata entries"
"not-an-entity": "Highlighted span is not actually a named entity upon closer inspection"
# Step 4: Disambiguation notes
- annotation_type: text
name: disambiguation_notes
description: "Optional: explain your disambiguation reasoning, especially for ambiguous or difficult cases."
annotation_instructions: |
You are annotating news text for named entity disambiguation.
For each text passage:
1. Highlight all named entity mentions using the span tool (use keys 1-4 for PER/ORG/LOC/MISC)
2. Enter the Wikidata QID for the most recently highlighted entity
3. Indicate whether the entity is linkable, NIL, ambiguous, or not actually an entity
4. Optionally add notes explaining your disambiguation reasoning
Pay special attention to ambiguous mentions like "Washington", "Paris", or "Jordan".
html_layout: |
<div style="padding: 15px; font-family: Georgia, serif;">
<div style="margin-bottom: 8px; color: #6b7280; font-size: 13px;">
<strong>Source:</strong> {{source}}
</div>
<div style="font-size: 16px; line-height: 1.8; background: #f9fafb; padding: 15px; border-left: 4px solid #22c55e; border-radius: 4px;">
{{text}}
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
示例数据sample-data.json
[
{
"id": "aida_001",
"text": "Michael Jordan announced his retirement from the Chicago Bulls in January 1999, ending an era that brought six NBA championships to the city of Chicago. The decision surprised fans across the United States.",
"source": "Reuters"
},
{
"id": "aida_002",
"text": "The European Union imposed sanctions on Russia following the annexation of Crimea. German Chancellor Angela Merkel and French President Emmanuel Macron led the diplomatic efforts in Brussels.",
"source": "Reuters"
}
]
// ... and 8 more items获取此设计
Clone or download from the repository
快速开始:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/text/entity-linking/aida-conll-entity-disambiguation potato start config.yaml
详情
标注类型
领域
应用场景
标签
发现问题或想改进此设计?
提交 Issue相关设计
Check-COVID: Fact-Checking COVID-19 News Claims
Fact-checking COVID-19 news claims. Annotators verify claims against evidence, identify supporting/refuting spans, and provide verdicts with explanations. Based on the Check-COVID dataset targeting misinformation during the pandemic.
Clickbait Spoiling
Classification and extraction of spoilers for clickbait posts, including spoiler type identification and span-level spoiler detection. Based on SemEval-2023 Task 5 (Hagen et al.).
MeasEval - Counts and Measurements
Extract and classify measurements, quantities, units, and measured entities from scientific text, based on SemEval-2021 Task 8 (Harper et al.). Annotators span-annotate measurement components and classify quantity types with normalized values.