Implicit Hate Speech Detection

Detect and categorize implicit hate speech using a six-category taxonomy. Based on ElSherief et al., EMNLP 2021. Identifies grievance, incitement, stereotypes, inferiority, irony, and threats.

Archivo de configuraciónconfig.yaml

# Implicit Hate Speech Detection
# Based on ElSherief et al., EMNLP 2021 "Latent Hatred"
# Paper: https://aclanthology.org/2021.emnlp-main.29/
# Dataset: https://github.com/SALT-NLP/implicit-hate
#
# Implicit hate speech is harder to detect than explicit hate because it
# relies on stereotypes, coded language, and indirect expression.
#
# Six-Category Taxonomy:
# 1. Grievance (24.2%): Positions majority groups as unfairly disadvantaged
# 2. Incitement (20.0%): Promotes hate groups/ideologies, flaunts in-group power
# 3. Stereotypes (17.9%): Associates groups with negative attributes via euphemisms
# 4. Inferiority (13.6%): Implies some group is of lesser value
# 5. Irony (12.6%): Uses sarcasm, humor, satire to demean
# 6. Threats (10.5%): Indirect threats to safety, well-being, reputation
#
# Annotation Guidelines:
# 1. First determine if the text contains implicit hate (vs explicit or none)
# 2. Implicit hate lacks slurs but conveys hateful meaning through context
# 3. Consider: Would someone from the target group feel attacked?
# 4. Look for coded language, dog whistles, and stereotypical references
# 5. Irony/sarcasm requires understanding the speaker's actual intent
#
# Key Distinction from Explicit Hate:
# - Explicit: Uses slurs, direct insults, clear threats
# - Implicit: Uses stereotypes, innuendo, coded language, sarcasm

annotation_task_name: "Implicit Hate Speech Detection"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  # Step 1: Hate classification
  - annotation_type: radio
    name: hate_type
    description: "Does this text contain hate speech?"
    labels:
      - "Explicit hate"
      - "Implicit hate"
      - "Not hateful"
    tooltips:
      "Explicit hate": "Contains clear slurs, direct insults, or obvious threats"
      "Implicit hate": "Conveys hate through stereotypes, coded language, or indirect means"
      "Not hateful": "Does not contain hateful content"

  # Step 2: Implicit hate category (if implicit)
  - annotation_type: radio
    name: implicit_category
    description: "What type of implicit hate is expressed?"
    labels:
      - "Grievance"
      - "Incitement"
      - "Stereotypes"
      - "Inferiority"
      - "Irony"
      - "Threats"
    tooltips:
      "Grievance": "Positions majority groups as unfairly disadvantaged (e.g., 'reverse racism', 'they get special treatment')"
      "Incitement": "Promotes hate groups/ideologies, celebrates in-group power (e.g., praising extremist symbols)"
      "Stereotypes": "Associates groups with negative attributes using euphemisms or coded language"
      "Inferiority": "Implies some group or person is of lesser value or capability"
      "Irony": "Uses sarcasm, humor, or satire to mock or demean a group"
      "Threats": "Indirect threats to body, well-being, reputation, or rights"

  # Step 3: Target group
  - annotation_type: radio
    name: target
    description: "Which group is targeted?"
    labels:
      - "Racial/ethnic minority"
      - "Religious group"
      - "Women"
      - "LGBTQ+ people"
      - "Immigrants"
      - "Other group"
      - "No specific group"
    tooltips:
      "Racial/ethnic minority": "Black, Asian, Hispanic, Indigenous, or other racial/ethnic groups"
      "Religious group": "Muslims, Jewish people, or other religious groups"
      "Women": "Women or girls based on gender"
      "LGBTQ+ people": "Gay, lesbian, bisexual, transgender, or other LGBTQ+ individuals"
      "Immigrants": "Immigrants, refugees, or non-citizens"
      "Other group": "Another identifiable group not listed"
      "No specific group": "No clear target group identified"

  # Step 4: Confidence
  - annotation_type: likert
    name: confidence
    description: "How confident are you in your annotation?"
    min_value: 1
    max_value: 5
    labels:
      1: "Very uncertain"
      2: "Somewhat uncertain"
      3: "Moderately confident"
      4: "Confident"
      5: "Very confident"

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Datos de ejemplosample-data.json

[
  {
    "id": "imp_001",
    "text": "It's funny how they get scholarships and jobs handed to them while the rest of us have to actually work for it."
  },
  {
    "id": "imp_002",
    "text": "The new employee seems very qualified. I'm excited to work with her on the project."
  }
]

// ... and 8 more items

Obtener este diseño

View on GitHub

Clone or download from the repository

Inicio rápido:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/hate-speech-moderation/implicit-hate-speech
potato start config.yaml

Detalles

Tipos de anotación

likertradio

Dominio

NLPContent ModerationSocial Media

Casos de uso

Hate Speech DetectionContent ModerationOnline Safety

Etiquetas

implicit-hatehate-speechstereotypesemnlp2021content-moderation

¿Encontró un problema o desea mejorar este diseño?

Abrir un issue

Diseños relacionados

Dynamic Hate Speech Detection

Hate speech classification with fine-grained type labels based on the Dynamically Generated Hate Speech Dataset (Vidgen et al., ACL 2021). Classify content as hateful or not, then identify hate type (animosity, derogation, dehumanization, threatening, support for hateful entities) and target group.

likertmultiselect

Clickbait Detection (Webis Clickbait Corpus)

Classify headlines and social media posts as clickbait or non-clickbait based on the Webis Clickbait Corpus. Identify manipulative content designed to attract clicks through sensationalism, curiosity gaps, or misleading framing.

likertmultiselect

Rumor Stance Detection (PHEME)

Classify stance toward rumors in social media threads. Based on PHEME (Zubiaga et al.). Label replies as supporting, denying, querying, or commenting on rumorous claims.

likertradio