Blog/Guides
Guides6 min read

Content Moderation Annotation Setup

Configure Potato for toxicity detection, hate speech classification, and sensitive content labeling with annotator wellbeing in mind.

By Potato Teamยท

Content Moderation Annotation Setup

Content moderation annotation requires careful consideration of annotator wellbeing, clear guidelines for subjective content, and robust quality control. This guide covers best practices for toxic content classification.

Annotator Wellbeing

Exposure to harmful content can affect annotator mental health. Implement protective measures:

Wellbeing Configuration

wellbeing:
  # Content warnings
  warnings:
    enabled: true
    show_before_session: true
    message: |
      This task involves reviewing potentially offensive content including
      hate speech, harassment, and explicit material. Take breaks as needed.
 
  # Break reminders
  breaks:
    enabled: true
    reminder_interval: 30  # minutes
    break_duration: 5  # suggested minutes
    message: "Consider taking a short break. Your wellbeing matters."
 
  # Session limits
  limits:
    max_session_duration: 120  # minutes
    max_items_per_session: 100
    cooldown_between_sessions: 60  # minutes
 
  # Easy exit
  exit:
    allow_immediate_exit: true
    no_penalty_exit: true
    exit_button_prominent: true
    exit_message: "No problem. Take care of yourself."
 
  # Support resources
  resources:
    show_support_link: true
    support_url: "https://yourorg.com/support"
    hotline_number: "1-800-XXX-XXXX"

Content Blurring

display:
  # Blur images by default
  image_display:
    blur_by_default: true
    blur_amount: 20
    click_to_reveal: true
    reveal_duration: 10  # auto-blur after 10 seconds
 
  # Text content warnings
  text_display:
    show_severity_indicator: true
    expandable_content: true
    default_collapsed: true

Toxicity Classification

Multi-Level Toxicity

annotation_schemes:
  - annotation_type: radio
    name: toxicity_level
    question: "Rate the toxicity level of this content"
    options:
      - name: none
        label: "Not Toxic"
        description: "No harmful content"
 
      - name: mild
        label: "Mildly Toxic"
        description: "Rude or insensitive but not severe"
 
      - name: moderate
        label: "Moderately Toxic"
        description: "Clearly offensive or harmful"
 
      - name: severe
        label: "Severely Toxic"
        description: "Extremely offensive, threatening, or dangerous"

Toxicity Categories

annotation_schemes:
  - annotation_type: multiselect
    name: toxicity_types
    question: "Select all types of toxicity present"
    options:
      - name: profanity
        label: "Profanity/Obscenity"
        description: "Swear words, vulgar language"
 
      - name: insult
        label: "Insults"
        description: "Personal attacks, name-calling"
 
      - name: threat
        label: "Threats"
        description: "Threats of violence or harm"
 
      - name: hate_speech
        label: "Hate Speech"
        description: "Targeting protected groups"
 
      - name: harassment
        label: "Harassment"
        description: "Targeted, persistent hostility"
 
      - name: sexual
        label: "Sexual Content"
        description: "Explicit or suggestive content"
 
      - name: self_harm
        label: "Self-Harm/Suicide"
        description: "Promoting or glorifying self-harm"
 
      - name: misinformation
        label: "Misinformation"
        description: "Demonstrably false claims"
 
      - name: spam
        label: "Spam/Scam"
        description: "Unwanted promotional content"

Hate Speech Detection

Target Groups

annotation_schemes:
  - annotation_type: multiselect
    name: target_groups
    question: "Which groups are targeted? (if hate speech detected)"
    depends_on:
      field: toxicity_types
      contains: hate_speech
 
    options:
      - name: race_ethnicity
        label: "Race/Ethnicity"
 
      - name: religion
        label: "Religion"
 
      - name: gender
        label: "Gender"
 
      - name: sexual_orientation
        label: "Sexual Orientation"
 
      - name: disability
        label: "Disability"
 
      - name: nationality
        label: "Nationality/Origin"
 
      - name: age
        label: "Age"
 
      - name: other
        label: "Other Protected Group"

Hate Speech Severity

annotation_schemes:
  - annotation_type: radio
    name: hate_severity
    question: "Severity of hate speech"
    depends_on:
      field: toxicity_types
      contains: hate_speech
 
    options:
      - name: implicit
        label: "Implicit"
        description: "Coded language, dog whistles"
 
      - name: explicit_mild
        label: "Explicit - Mild"
        description: "Clear but not threatening"
 
      - name: explicit_severe
        label: "Explicit - Severe"
        description: "Dehumanizing, threatening, or violent"

Contextual Moderation

Platform-Specific Rules

# Context affects what's acceptable
annotation_schemes:
  - annotation_type: radio
    name: context_appropriate
    question: "Is this content appropriate for the platform context?"
    context_info:
      platform: "{{metadata.platform}}"
      community: "{{metadata.community}}"
      audience: "{{metadata.audience}}"
 
    options:
      - name: appropriate
        label: "Appropriate for Context"
 
      - name: borderline
        label: "Borderline"
 
      - name: inappropriate
        label: "Inappropriate for Context"
 
  - annotation_type: text
    name: context_notes
    question: "Explain your contextual reasoning"
    depends_on:
      field: context_appropriate
      value: borderline

Intent Detection

annotation_schemes:
  - annotation_type: radio
    name: intent
    question: "What is the apparent intent?"
    options:
      - name: genuine_attack
        label: "Genuine Attack"
        description: "Intent to harm or offend"
 
      - name: satire
        label: "Satire/Parody"
        description: "Mocking toxic behavior"
 
      - name: quote
        label: "Quote/Report"
        description: "Reporting or discussing toxic content"
 
      - name: reclaimed
        label: "Reclaimed Language"
        description: "In-group use of slurs"
 
      - name: unclear
        label: "Unclear Intent"

Image Content Moderation

Visual Content Classification

annotation_schemes:
  - annotation_type: multiselect
    name: image_violations
    question: "Select all policy violations"
    options:
      - name: nudity
        label: "Nudity/Sexual Content"
 
      - name: violence_graphic
        label: "Graphic Violence"
 
      - name: gore
        label: "Gore/Disturbing Content"
 
      - name: hate_symbols
        label: "Hate Symbols"
 
      - name: dangerous_acts
        label: "Dangerous Acts"
 
      - name: child_safety
        label: "Child Safety Concern"
        priority: critical
        escalate: true
 
      - name: none
        label: "No Violations"
 
  - annotation_type: radio
    name: action_recommendation
    question: "Recommended action"
    options:
      - name: approve
        label: "Approve"
      - name: age_restrict
        label: "Age-Restrict"
      - name: warning_label
        label: "Add Warning Label"
      - name: remove
        label: "Remove"
      - name: escalate
        label: "Escalate to Specialist"

Quality Control

quality_control:
  # Calibration for subjective content
  calibration:
    enabled: true
    frequency: 20  # Every 20 items
    items: calibration/moderation_gold.json
    feedback: true
    recalibrate_on_drift: true
 
  # High redundancy for borderline cases
  redundancy:
    annotations_per_item: 3
    increase_for_borderline: 5
    agreement_threshold: 0.67
 
  # Expert escalation
  escalation:
    enabled: true
    triggers:
      - field: toxicity_level
        value: severe
      - field: image_violations
        contains: child_safety
    escalate_to: trust_safety_team
 
  # Distribution monitoring
  monitoring:
    track_distribution: true
    alert_on_skew: true
    expected_distribution:
      none: 0.4
      mild: 0.3
      moderate: 0.2
      severe: 0.1

Complete Configuration

annotation_task_name: "Content Moderation"
 
# Wellbeing first
wellbeing:
  warnings:
    enabled: true
    message: "This task contains potentially offensive content."
  breaks:
    reminder_interval: 30
    message: "Remember to take breaks."
  limits:
    max_session_duration: 90
    max_items_per_session: 75
 
display:
  # Blur sensitive content
  image_display:
    blur_by_default: true
    click_to_reveal: true
 
  # Show platform context
  metadata_display:
    show_fields: [platform, community, report_reason]
 
annotation_schemes:
  # Toxicity level
  - annotation_type: radio
    name: toxicity
    question: "Toxicity level"
    options:
      - name: none
        label: "None"
      - name: mild
        label: "Mild"
      - name: moderate
        label: "Moderate"
      - name: severe
        label: "Severe"
 
  # Categories
  - annotation_type: multiselect
    name: categories
    question: "Types of harmful content (select all)"
    options:
      - name: hate
        label: "Hate Speech"
      - name: harassment
        label: "Harassment"
      - name: violence
        label: "Violence/Threats"
      - name: sexual
        label: "Sexual Content"
      - name: self_harm
        label: "Self-Harm"
      - name: spam
        label: "Spam"
      - name: none
        label: "None"
 
  # Confidence
  - annotation_type: likert
    name: confidence
    question: "How confident are you?"
    size: 5
    min_label: "Uncertain"
    max_label: "Very Confident"
 
  # Notes
  - annotation_type: text
    name: notes
    question: "Additional notes (optional)"
    multiline: true
 
quality_control:
  redundancy:
    annotations_per_item: 3
  calibration:
    enabled: true
    frequency: 25
  escalation:
    enabled: true
    triggers:
      - field: toxicity
        value: severe
 
output_annotation_dir: annotations/
output_annotation_format: jsonl

Guidelines Best Practices

When writing moderation guidelines:

  1. Define clear thresholds: What makes something "mild" vs "moderate"?
  2. Provide examples: Show edge cases with explanations
  3. Address context: How does platform/audience affect decisions?
  4. Handle ambiguity: What to do when intent is unclear
  5. Update regularly: Guidelines evolve with new content types

Annotator Support

  • Rotation: Don't assign same person to toxic content continuously
  • Support access: Easy access to mental health resources
  • Feedback channels: Let annotators report concerns
  • Recognition: Acknowledge the difficulty of the work

Full documentation at /docs/core-concepts/annotation-types.