SecureNLP - Malware and Security Entity Recognition

Named entity recognition in cybersecurity text, identifying malware, attack patterns, indicators, tools, vulnerabilities, and organizations, with document-level classification. Based on SemEval-2018 Task 8.

Configuration Fileconfig.yaml

# SecureNLP - Malware and Security Entity Recognition
# Based on Phandi et al., SemEval 2018
# Paper: https://aclanthology.org/S18-1113/
# Dataset: https://competitions.codalab.org/competitions/17262
#
# This task asks annotators to identify cybersecurity-related entities
# in text and classify the document type.
#
# Entity Span Labels:
# - Malware: Names of malware, viruses, trojans, ransomware
# - Attack Pattern: Descriptions of attack techniques or methods
# - Indicator: IoCs such as IP addresses, hashes, domains
# - Tool: Security or hacking tools mentioned
# - Vulnerability: CVE identifiers or vulnerability descriptions
# - Organization: Threat actor groups or affected organizations
#
# Document Types:
# - Threat Report, Advisory, Analysis, News

annotation_task_name: "SecureNLP - Cybersecurity Entity Recognition"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: span
    name: security_entities
    description: "Highlight all cybersecurity-related entities in the text."
    labels:
      - "Malware"
      - "Attack Pattern"
      - "Indicator"
      - "Tool"
      - "Vulnerability"
      - "Organization"

  - annotation_type: radio
    name: document_type
    description: "What type of cybersecurity document is this?"
    labels:
      - "Threat Report"
      - "Advisory"
      - "Analysis"
      - "News"
    keyboard_shortcuts:
      "Threat Report": "1"
      "Advisory": "2"
      "Analysis": "3"
      "News": "4"
    tooltips:
      "Threat Report": "Detailed report on a specific threat or campaign"
      "Advisory": "Security advisory or bulletin with recommendations"
      "Analysis": "Technical analysis of malware, vulnerabilities, or attacks"
      "News": "News article covering cybersecurity events"

annotation_instructions: |
  You will be shown a cybersecurity text. Your task is to:
  1. Highlight all cybersecurity entities (malware names, attack patterns, indicators,
     tools, vulnerabilities, and organizations).
  2. Classify the document type (threat report, advisory, analysis, or news).

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 12px; margin-bottom: 12px;">
      <strong style="color: #a16207;">Source Type:</strong>
      <span style="font-size: 15px;">{{source_type}}</span>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Text:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "cyberner_001",
    "text": "The WannaCry ransomware exploited the EternalBlue vulnerability (CVE-2017-0144) in Windows SMB protocol. The attack affected over 200,000 computers in 150 countries, with Lazarus Group identified as the likely threat actor.",
    "source_type": "Threat Report"
  },
  {
    "id": "cyberner_002",
    "text": "A critical buffer overflow vulnerability (CVE-2018-4878) has been discovered in Adobe Flash Player. Users are advised to update to version 28.0.0.161 or later immediately. The vulnerability is being actively exploited in the wild.",
    "source_type": "Advisory"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/semeval/2018/task08-cybersecurity-ner
potato start config.yaml

Details

Annotation Types

spanradio

Domain

SemEvalNLPCybersecurityNamed Entity Recognition

Use Cases

Cybersecurity NERThreat IntelligenceInformation Extraction

Related Designs

Aspect-Based Sentiment Analysis

Identification of aspect terms in review text with sentiment polarity classification for each aspect. Based on SemEval-2016 Task 5 (ABSA).

spanradio

Causal Medical Claim Detection and PICO Extraction

Detection of causal claims in medical texts and extraction of PICO (Population, Intervention, Comparator, Outcome) elements. Based on SemEval-2023 Task 8 (Khetan et al.).

spanradio

Character Identification on Multiparty Dialogues

Identification and linking of character mentions in TV show dialogue, combining span annotation with entity resolution for the main cast of Friends. Based on SemEval-2018 Task 4.

spanradio

SecureNLP - Malware and Security Entity Recognition

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

Aspect-Based Sentiment Analysis

Causal Medical Claim Detection and PICO Extraction

Character Identification on Multiparty Dialogues