Content Moderation Annotation Setup
Configure Potato for toxicity detection, hate speech classification, and sensitive content labeling with annotator wellbeing in mind.
Labeling toxic content is not like labeling movie reviews. The work wears people down, the guidelines are always a little fuzzy, and the cases that matter most are usually the ones annotators disagree about. This guide covers how to set up a content moderation task in Potato that takes those problems seriously, starting with the people doing the work.
Annotator wellbeing
Reading hateful and graphic content for hours has real effects on the people who do it. A few settings help keep the work bearable.
Wellbeing configuration
wellbeing:
# Content warnings
warnings:
enabled: true
show_before_session: true
message: |
This task involves reviewing potentially offensive content including
hate speech, harassment, and explicit material. Take breaks as needed.
# Break reminders
breaks:
enabled: true
reminder_interval: 30 # minutes
break_duration: 5 # suggested minutes
message: "Consider taking a short break. Your wellbeing matters."
# Session limits
limits:
max_session_duration: 120 # minutes
max_items_per_session: 100
cooldown_between_sessions: 60 # minutes
# Easy exit
exit:
allow_immediate_exit: true
no_penalty_exit: true
exit_button_prominent: true
exit_message: "No problem. Take care of yourself."
# Support resources
resources:
show_support_link: true
support_url: "https://yourorg.com/support"
hotline_number: "1-800-XXX-XXXX"Content Blurring
display:
# Blur images by default
image_display:
blur_by_default: true
blur_amount: 20
click_to_reveal: true
reveal_duration: 10 # auto-blur after 10 seconds
# Text content warnings
text_display:
show_severity_indicator: true
expandable_content: true
default_collapsed: trueToxicity classification
Multi-level toxicity
annotation_schemes:
- annotation_type: radio
name: toxicity_level
question: "Rate the toxicity level of this content"
options:
- name: none
label: "Not Toxic"
description: "No harmful content"
- name: mild
label: "Mildly Toxic"
description: "Rude or insensitive but not severe"
- name: moderate
label: "Moderately Toxic"
description: "Clearly offensive or harmful"
- name: severe
label: "Severely Toxic"
description: "Extremely offensive, threatening, or dangerous"Toxicity Categories
annotation_schemes:
- annotation_type: multiselect
name: toxicity_types
question: "Select all types of toxicity present"
options:
- name: profanity
label: "Profanity/Obscenity"
description: "Swear words, vulgar language"
- name: insult
label: "Insults"
description: "Personal attacks, name-calling"
- name: threat
label: "Threats"
description: "Threats of violence or harm"
- name: hate_speech
label: "Hate Speech"
description: "Targeting protected groups"
- name: harassment
label: "Harassment"
description: "Targeted, persistent hostility"
- name: sexual
label: "Sexual Content"
description: "Explicit or suggestive content"
- name: self_harm
label: "Self-Harm/Suicide"
description: "Promoting or glorifying self-harm"
- name: misinformation
label: "Misinformation"
description: "Demonstrably false claims"
- name: spam
label: "Spam/Scam"
description: "Unwanted promotional content"Hate Speech Detection
Target Groups
annotation_schemes:
- annotation_type: multiselect
name: target_groups
question: "Which groups are targeted? (if hate speech detected)"
depends_on:
field: toxicity_types
contains: hate_speech
options:
- name: race_ethnicity
label: "Race/Ethnicity"
- name: religion
label: "Religion"
- name: gender
label: "Gender"
- name: sexual_orientation
label: "Sexual Orientation"
- name: disability
label: "Disability"
- name: nationality
label: "Nationality/Origin"
- name: age
label: "Age"
- name: other
label: "Other Protected Group"Hate Speech Severity
annotation_schemes:
- annotation_type: radio
name: hate_severity
question: "Severity of hate speech"
depends_on:
field: toxicity_types
contains: hate_speech
options:
- name: implicit
label: "Implicit"
description: "Coded language, dog whistles"
- name: explicit_mild
label: "Explicit - Mild"
description: "Clear but not threatening"
- name: explicit_severe
label: "Explicit - Severe"
description: "Dehumanizing, threatening, or violent"Contextual Moderation
Platform-Specific Rules
# Context affects what's acceptable
annotation_schemes:
- annotation_type: radio
name: context_appropriate
question: "Is this content appropriate for the platform context?"
context_info:
platform: "{{metadata.platform}}"
community: "{{metadata.community}}"
audience: "{{metadata.audience}}"
options:
- name: appropriate
label: "Appropriate for Context"
- name: borderline
label: "Borderline"
- name: inappropriate
label: "Inappropriate for Context"
- annotation_type: text
name: context_notes
question: "Explain your contextual reasoning"
depends_on:
field: context_appropriate
value: borderlineIntent Detection
annotation_schemes:
- annotation_type: radio
name: intent
question: "What is the apparent intent?"
options:
- name: genuine_attack
label: "Genuine Attack"
description: "Intent to harm or offend"
- name: satire
label: "Satire/Parody"
description: "Mocking toxic behavior"
- name: quote
label: "Quote/Report"
description: "Reporting or discussing toxic content"
- name: reclaimed
label: "Reclaimed Language"
description: "In-group use of slurs"
- name: unclear
label: "Unclear Intent"Image Content Moderation
Visual Content Classification
annotation_schemes:
- annotation_type: multiselect
name: image_violations
question: "Select all policy violations"
options:
- name: nudity
label: "Nudity/Sexual Content"
- name: violence_graphic
label: "Graphic Violence"
- name: gore
label: "Gore/Disturbing Content"
- name: hate_symbols
label: "Hate Symbols"
- name: dangerous_acts
label: "Dangerous Acts"
- name: child_safety
label: "Child Safety Concern"
priority: critical
escalate: true
- name: none
label: "No Violations"
- annotation_type: radio
name: action_recommendation
question: "Recommended action"
options:
- name: approve
label: "Approve"
- name: age_restrict
label: "Age-Restrict"
- name: warning_label
label: "Add Warning Label"
- name: remove
label: "Remove"
- name: escalate
label: "Escalate to Specialist"Quality Control
quality_control:
# Calibration for subjective content
calibration:
enabled: true
frequency: 20 # Every 20 items
items: calibration/moderation_gold.json
feedback: true
recalibrate_on_drift: true
# High redundancy for borderline cases
redundancy:
annotations_per_item: 3
increase_for_borderline: 5
agreement_threshold: 0.67
# Expert escalation
escalation:
enabled: true
triggers:
- field: toxicity_level
value: severe
- field: image_violations
contains: child_safety
escalate_to: trust_safety_team
# Distribution monitoring
monitoring:
track_distribution: true
alert_on_skew: true
expected_distribution:
none: 0.4
mild: 0.3
moderate: 0.2
severe: 0.1Complete Configuration
annotation_task_name: "Content Moderation"
# Wellbeing first
wellbeing:
warnings:
enabled: true
message: "This task contains potentially offensive content."
breaks:
reminder_interval: 30
message: "Remember to take breaks."
limits:
max_session_duration: 90
max_items_per_session: 75
display:
# Blur sensitive content
image_display:
blur_by_default: true
click_to_reveal: true
# Show platform context
metadata_display:
show_fields: [platform, community, report_reason]
annotation_schemes:
# Toxicity level
- annotation_type: radio
name: toxicity
question: "Toxicity level"
options:
- name: none
label: "None"
- name: mild
label: "Mild"
- name: moderate
label: "Moderate"
- name: severe
label: "Severe"
# Categories
- annotation_type: multiselect
name: categories
question: "Types of harmful content (select all)"
options:
- name: hate
label: "Hate Speech"
- name: harassment
label: "Harassment"
- name: violence
label: "Violence/Threats"
- name: sexual
label: "Sexual Content"
- name: self_harm
label: "Self-Harm"
- name: spam
label: "Spam"
- name: none
label: "None"
# Confidence
- annotation_type: likert
name: confidence
question: "How confident are you?"
size: 5
min_label: "Uncertain"
max_label: "Very Confident"
# Notes
- annotation_type: text
name: notes
question: "Additional notes (optional)"
multiline: true
quality_control:
redundancy:
annotations_per_item: 3
calibration:
enabled: true
frequency: 25
escalation:
enabled: true
triggers:
- field: toxicity
value: severe
output_annotation_dir: annotations/
export_annotation_format: jsonlWriting the guidelines
Most disagreement between annotators traces back to vague guidelines, so this is where the effort pays off. Spell out what separates "mild" from "moderate" instead of leaving annotators to guess. Show the edge cases, with explanations for why each one landed where it did. Say how platform and audience change the call, since the same sentence can be fine in one community and a violation in another. And tell annotators what to do when intent is genuinely unclear rather than pretending that never happens. Expect to revise all of this as new kinds of content show up.
Supporting annotators
Do not park one person on toxic content all day; rotate the work around. Keep mental health resources one click away, give people a real channel to flag concerns, and acknowledge that this is hard work. It is.
For how the underlying classification schemes work, see the text annotation documentation. The quality control guide covers redundancy and calibration in more detail.
Full documentation at /docs/core-concepts/annotation-schemes.