Social Determinants of Health (SDOH) Extraction

Event-based extraction of social determinants of health from clinical notes based on the n2c2 2022 Track 2 shared task and SHAC corpus. Annotates substance use (alcohol, drug, tobacco), employment, and living status with temporal and status attributes.

Configuration Fileconfig.yaml

# Social Determinants of Health (SDOH) Extraction
# Based on n2c2 2022 Track 2 / SHAC Corpus
# Paper: https://pubmed.ncbi.nlm.nih.gov/36795066/
# Task: https://n2c2.dbmi.hms.harvard.edu/2022-track-2
#
# Event-based annotation schema:
# - Each SDOH event has a TRIGGER (event type) and ARGUMENTS (attributes)
# - Event types: Alcohol, Drug, Tobacco, Employment, LivingStatus
# - Arguments characterize status, extent, temporality, and type
#
# Example: "Patient quit smoking 5 years ago"
#   Trigger: Tobacco (anchored to "smoking")
#   Arguments: Status=past, StatusTime=past ("quit", "5 years ago")

annotation_task_name: "Social Determinants of Health (SDOH) Extraction"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  # Step 1: Identify SDOH event triggers
  - annotation_type: span
    name: sdoh_trigger
    description: "Highlight the trigger word/phrase that indicates an SDOH event"
    labels:
      - Alcohol
      - Drug
      - Tobacco
      - Employment
      - LivingStatus
    label_colors:
      Alcohol: "#ef4444"
      Drug: "#f97316"
      Tobacco: "#eab308"
      Employment: "#22c55e"
      LivingStatus: "#3b82f6"
    tooltips:
      Alcohol: "Mentions of alcohol use, drinking, or alcohol-related behaviors"
      Drug: "Mentions of illicit drug use, substance abuse, or recreational drug use"
      Tobacco: "Mentions of smoking, tobacco use, vaping, or nicotine"
      Employment: "Mentions of job status, work, occupation, unemployment, or retirement"
      LivingStatus: "Mentions of housing, living situation, homelessness, or living arrangements"
    allow_overlapping: false

  # Step 2: Status of the SDOH event
  - annotation_type: radio
    name: status
    description: "What is the status of this SDOH factor?"
    labels:
      - "current"
      - "past"
      - "none"
      - "unknown"
    keyboard_shortcuts:
      "current": "c"
      "past": "p"
      "none": "n"
      "unknown": "u"
    tooltips:
      "current": "Patient currently has this status (e.g., currently smokes, currently employed)"
      "past": "Patient had this status in the past (e.g., former smoker, previously unemployed)"
      "none": "Patient does not have this status (e.g., never smoked, denies alcohol use)"
      "unknown": "Status cannot be determined from the text"

  # Step 3: Additional arguments for substance use
  - annotation_type: multiselect
    name: substance_attributes
    description: "For substance use events, select applicable attributes"
    labels:
      - "Amount mentioned"
      - "Frequency mentioned"
      - "Duration mentioned"
      - "Type/Method specified"
      - "Quit attempt mentioned"
    tooltips:
      "Amount mentioned": "Text specifies quantity (e.g., '2 drinks/day', 'pack of cigarettes')"
      "Frequency mentioned": "Text specifies how often (e.g., 'daily', 'occasionally', 'weekends')"
      "Duration mentioned": "Text specifies time period (e.g., '10 years', 'since college')"
      "Type/Method specified": "Text specifies substance type or method (e.g., 'beer', 'marijuana', 'vaping')"
      "Quit attempt mentioned": "Text mentions attempt to quit or cessation efforts"

  # Step 4: Living status type (for LivingStatus events)
  - annotation_type: radio
    name: living_type
    description: "For LivingStatus events, what is the living arrangement?"
    labels:
      - "alone"
      - "with_family"
      - "with_others"
      - "homeless"
      - "institution"
      - "not_specified"
    tooltips:
      "alone": "Patient lives alone"
      "with_family": "Patient lives with family members (spouse, children, parents)"
      "with_others": "Patient lives with non-family (roommates, friends)"
      "homeless": "Patient is homeless, in shelter, or has unstable housing"
      "institution": "Patient lives in facility (nursing home, assisted living, group home)"
      "not_specified": "Living arrangement not specified in text"

  # Step 5: Employment type (for Employment events)
  - annotation_type: radio
    name: employment_type
    description: "For Employment events, what is the employment status?"
    labels:
      - "employed"
      - "unemployed"
      - "retired"
      - "disabled"
      - "student"
      - "homemaker"
      - "not_specified"
    tooltips:
      "employed": "Patient is currently working"
      "unemployed": "Patient is not working and seeking employment"
      "retired": "Patient has retired from work"
      "disabled": "Patient is unable to work due to disability"
      "student": "Patient is a student"
      "homemaker": "Patient works in the home/is a caregiver"
      "not_specified": "Employment type not specified"

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "sdoh_001",
    "text": "Social History: Patient is a 45-year-old male who reports smoking 1 pack per day for the past 20 years. He denies alcohol use. Currently employed as a construction worker. Lives with wife and two children."
  },
  {
    "id": "sdoh_002",
    "text": "The patient is a former smoker, quit 5 years ago after 30 pack-year history. She drinks 1-2 glasses of wine on weekends. Retired teacher. Lives alone in an apartment."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/information-extraction/sdoh-extraction
potato start config.yaml

Details

Annotation Types

spanradiomultiselect

Domain

Clinical NLPHealthcare

Use Cases

Information ExtractionClinical DocumentationPublic Health

Related Designs

Food Hazard Detection

Food safety hazard detection task requiring annotators to identify hazards, products, and risk levels in food incident reports, and classify the type of contamination. Based on SemEval-2025 Task 9.

spanradio

HateXplain - Explainable Hate Speech Detection

Multi-task hate speech annotation with classification (hate/offensive/normal), target community identification, and rationale span highlighting. Based on the HateXplain benchmark (Mathew et al., AAAI 2021) - the first dataset covering classification, target identification, and rationale extraction.

radiomultiselect

MediTOD Medical Dialogue Annotation

Medical history-taking dialogue annotation based on the MediTOD dataset. Annotators label dialogue acts, identify medical entities (symptoms, conditions, medications, tests), and assess doctor-patient communication quality across multi-turn clinical conversations.

radiospan