HateXplain - Explainable Hate Speech Detection

Multi-task hate speech annotation with classification (hate/offensive/normal), target community identification, and rationale span highlighting. Based on the HateXplain benchmark (Mathew et al., AAAI 2021) - the first dataset covering classification, target identification, and rationale extraction.

Configuration Fileconfig.yaml

# HateXplain - Explainable Hate Speech Detection
# Based on Mathew et al., AAAI 2021
# Paper: https://ojs.aaai.org/index.php/AAAI/article/view/17745
# Dataset: https://huggingface.co/datasets/hatexplain
#
# Three annotation tasks:
# 1. Classification: hate speech, offensive, or normal
# 2. Target community: which group is targeted (if hate/offensive)
# 3. Rationale spans: which words justify the classification
#
# Guidelines:
# - Hate speech: attacks or demeans a group based on identity
# - Offensive: rude/disrespectful but not targeting identity groups
# - Normal: neither hateful nor offensive
# - Rationale: highlight words that justify your classification (avg 5.5 tokens)

annotation_task_name: "HateXplain: Explainable Hate Speech Detection"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  # Task 1: Classification
  - annotation_type: radio
    name: classification
    description: "Classify this text as hate speech, offensive, or normal"
    labels:
      - Hate Speech
      - Offensive
      - Normal
    keyboard_shortcuts:
      "Hate Speech": "h"
      "Offensive": "o"
      "Normal": "n"
    tooltips:
      "Hate Speech": "Content that attacks or demeans a group based on identity attributes (race, religion, gender, etc.)"
      "Offensive": "Rude, disrespectful, or profane content that does NOT target identity groups"
      "Normal": "Content that is neither hateful nor offensive"

  # Task 2: Target community (only for hate/offensive)
  - annotation_type: multiselect
    name: target_community
    description: "If hate/offensive, select the targeted community/communities"
    labels:
      - African
      - Arab
      - Asian
      - Caucasian
      - Hispanic
      - Jewish
      - LGBTQ
      - Islam
      - Women
      - Refugee
      - Other
      - None/Not Applicable
    tooltips:
      African: "People of African descent"
      Arab: "People of Arab descent or from Arab countries"
      Asian: "People of Asian descent"
      Caucasian: "People of European/white descent"
      Hispanic: "People of Hispanic/Latino descent"
      Jewish: "Jewish people (ethnic or religious)"
      LGBTQ: "Lesbian, gay, bisexual, transgender, queer individuals"
      Islam: "Muslims or Islamic religion"
      Women: "Women or girls"
      Refugee: "Refugees, immigrants, or asylum seekers"
      Other: "Other identity group not listed"
      "None/Not Applicable": "No specific group targeted (for Normal texts)"

  # Task 3: Rationale span annotation
  - annotation_type: span
    name: rationale
    description: "Highlight the words/phrases that justify your classification decision"
    labels:
      - Rationale
    label_colors:
      Rationale: "#ef4444"
    tooltips:
      Rationale: "Words or phrases that are the reason for classifying as hate/offensive (avg ~5.5 tokens per post)"
    allow_overlapping: false

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: true

Sample Datasample-data.json

[
  {
    "id": "hatex_001",
    "text": "I can't believe how beautiful the sunset was today. Nature is truly amazing."
  },
  {
    "id": "hatex_002",
    "text": "These people should go back to where they came from. They don't belong here."
  }
]

// ... and 10 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/hate-speech-moderation/hatexplain
potato start config.yaml

Details

Annotation Types

radiomultiselectspan

Domain

NLPContent Moderation

Use Cases

Hate Speech DetectionExplainable AIContent Moderation

Related Designs

Toxic Spans Detection

Character-level toxic span annotation based on SemEval-2021 Task 5 (Pavlopoulos et al., 2021). Instead of binary toxicity classification, annotators identify the specific words/phrases that make a comment toxic, enabling more nuanced content moderation.

multiselectradio

Food Hazard Detection

Food safety hazard detection task requiring annotators to identify hazards, products, and risk levels in food incident reports, and classify the type of contamination. Based on SemEval-2025 Task 9.

spanradio

MediTOD Medical Dialogue Annotation

Medical history-taking dialogue annotation based on the MediTOD dataset. Annotators label dialogue acts, identify medical entities (symptoms, conditions, medications, tests), and assess doctor-patient communication quality across multi-turn clinical conversations.

radiospan