Social Determinants of Health (SDOH) Extraction
Event-based extraction of social determinants of health from clinical notes based on the n2c2 2022 Track 2 shared task and SHAC corpus. Annotates substance use (alcohol, drug, tobacco), employment, and living status with temporal and status attributes.
Configuration Fileconfig.yaml
# Social Determinants of Health (SDOH) Extraction
# Based on n2c2 2022 Track 2 / SHAC Corpus
# Paper: https://pubmed.ncbi.nlm.nih.gov/36795066/
# Task: https://n2c2.dbmi.hms.harvard.edu/2022-track-2
#
# Event-based annotation schema:
# - Each SDOH event has a TRIGGER (event type) and ARGUMENTS (attributes)
# - Event types: Alcohol, Drug, Tobacco, Employment, LivingStatus
# - Arguments characterize status, extent, temporality, and type
#
# Example: "Patient quit smoking 5 years ago"
# Trigger: Tobacco (anchored to "smoking")
# Arguments: Status=past, StatusTime=past ("quit", "5 years ago")
annotation_task_name: "Social Determinants of Health (SDOH) Extraction"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
# Step 1: Identify SDOH event triggers
- annotation_type: span
name: sdoh_trigger
description: "Highlight the trigger word/phrase that indicates an SDOH event"
labels:
- Alcohol
- Drug
- Tobacco
- Employment
- LivingStatus
label_colors:
Alcohol: "#ef4444"
Drug: "#f97316"
Tobacco: "#eab308"
Employment: "#22c55e"
LivingStatus: "#3b82f6"
tooltips:
Alcohol: "Mentions of alcohol use, drinking, or alcohol-related behaviors"
Drug: "Mentions of illicit drug use, substance abuse, or recreational drug use"
Tobacco: "Mentions of smoking, tobacco use, vaping, or nicotine"
Employment: "Mentions of job status, work, occupation, unemployment, or retirement"
LivingStatus: "Mentions of housing, living situation, homelessness, or living arrangements"
allow_overlapping: false
# Step 2: Status of the SDOH event
- annotation_type: radio
name: status
description: "What is the status of this SDOH factor?"
labels:
- "current"
- "past"
- "none"
- "unknown"
keyboard_shortcuts:
"current": "c"
"past": "p"
"none": "n"
"unknown": "u"
tooltips:
"current": "Patient currently has this status (e.g., currently smokes, currently employed)"
"past": "Patient had this status in the past (e.g., former smoker, previously unemployed)"
"none": "Patient does not have this status (e.g., never smoked, denies alcohol use)"
"unknown": "Status cannot be determined from the text"
# Step 3: Additional arguments for substance use
- annotation_type: multiselect
name: substance_attributes
description: "For substance use events, select applicable attributes"
labels:
- "Amount mentioned"
- "Frequency mentioned"
- "Duration mentioned"
- "Type/Method specified"
- "Quit attempt mentioned"
tooltips:
"Amount mentioned": "Text specifies quantity (e.g., '2 drinks/day', 'pack of cigarettes')"
"Frequency mentioned": "Text specifies how often (e.g., 'daily', 'occasionally', 'weekends')"
"Duration mentioned": "Text specifies time period (e.g., '10 years', 'since college')"
"Type/Method specified": "Text specifies substance type or method (e.g., 'beer', 'marijuana', 'vaping')"
"Quit attempt mentioned": "Text mentions attempt to quit or cessation efforts"
# Step 4: Living status type (for LivingStatus events)
- annotation_type: radio
name: living_type
description: "For LivingStatus events, what is the living arrangement?"
labels:
- "alone"
- "with_family"
- "with_others"
- "homeless"
- "institution"
- "not_specified"
tooltips:
"alone": "Patient lives alone"
"with_family": "Patient lives with family members (spouse, children, parents)"
"with_others": "Patient lives with non-family (roommates, friends)"
"homeless": "Patient is homeless, in shelter, or has unstable housing"
"institution": "Patient lives in facility (nursing home, assisted living, group home)"
"not_specified": "Living arrangement not specified in text"
# Step 5: Employment type (for Employment events)
- annotation_type: radio
name: employment_type
description: "For Employment events, what is the employment status?"
labels:
- "employed"
- "unemployed"
- "retired"
- "disabled"
- "student"
- "homemaker"
- "not_specified"
tooltips:
"employed": "Patient is currently working"
"unemployed": "Patient is not working and seeking employment"
"retired": "Patient has retired from work"
"disabled": "Patient is unable to work due to disability"
"student": "Patient is a student"
"homemaker": "Patient works in the home/is a caregiver"
"not_specified": "Employment type not specified"
allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "sdoh_001",
"text": "Social History: Patient is a 45-year-old male who reports smoking 1 pack per day for the past 20 years. He denies alcohol use. Currently employed as a construction worker. Lives with wife and two children."
},
{
"id": "sdoh_002",
"text": "The patient is a former smoker, quit 5 years ago after 30 pack-year history. She drinks 1-2 glasses of wine on weekends. Retired teacher. Lives alone in an apartment."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/text/information-extraction/sdoh-extraction potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
Food Hazard Detection
Food safety hazard detection task requiring annotators to identify hazards, products, and risk levels in food incident reports, and classify the type of contamination. Based on SemEval-2025 Task 9.
HateXplain - Explainable Hate Speech Detection
Multi-task hate speech annotation with classification (hate/offensive/normal), target community identification, and rationale span highlighting. Based on the HateXplain benchmark (Mathew et al., AAAI 2021) - the first dataset covering classification, target identification, and rationale extraction.
MediTOD Medical Dialogue Annotation
Medical history-taking dialogue annotation based on the MediTOD dataset. Annotators label dialogue acts, identify medical entities (symptoms, conditions, medications, tests), and assess doctor-patient communication quality across multi-turn clinical conversations.