콘텐츠 모더레이션 어노테이션 설정
독성 탐지, 혐오 발언 분류, 민감 콘텐츠 라벨링을 위해 Potato를 설정하되 어노테이터의 웰빙을 염두에 두는 방법을 알아봅니다.
독성 콘텐츠를 라벨링하는 일은 영화 리뷰를 라벨링하는 것과 다릅니다. 이 작업은 사람을 지치게 하고, 지침은 늘 조금 모호하며, 가장 중요한 사례들은 보통 어노테이터들이 서로 의견이 갈리는 것들입니다. 이 가이드는 이러한 문제들을 진지하게 다루는 콘텐츠 모더레이션 작업을 Potato에서 구성하는 방법을 다루며, 그 일을 하는 사람들에서 출발합니다.
어노테이터 웰빙
혐오스럽고 충격적인 콘텐츠를 몇 시간씩 읽는 것은 그 일을 하는 사람들에게 실제로 영향을 미칩니다. 몇 가지 설정이 작업을 견딜 만하게 유지하는 데 도움이 됩니다.
웰빙 구성
wellbeing:
# Content warnings
warnings:
enabled: true
show_before_session: true
message: |
This task involves reviewing potentially offensive content including
hate speech, harassment, and explicit material. Take breaks as needed.
# Break reminders
breaks:
enabled: true
reminder_interval: 30 # minutes
break_duration: 5 # suggested minutes
message: "Consider taking a short break. Your wellbeing matters."
# Session limits
limits:
max_session_duration: 120 # minutes
max_items_per_session: 100
cooldown_between_sessions: 60 # minutes
# Easy exit
exit:
allow_immediate_exit: true
no_penalty_exit: true
exit_button_prominent: true
exit_message: "No problem. Take care of yourself."
# Support resources
resources:
show_support_link: true
support_url: "https://yourorg.com/support"
hotline_number: "1-800-XXX-XXXX"콘텐츠 흐리게 처리
display:
# Blur images by default
image_display:
blur_by_default: true
blur_amount: 20
click_to_reveal: true
reveal_duration: 10 # auto-blur after 10 seconds
# Text content warnings
text_display:
show_severity_indicator: true
expandable_content: true
default_collapsed: true독성 분류
다단계 독성
annotation_schemes:
- annotation_type: radio
name: toxicity_level
question: "Rate the toxicity level of this content"
options:
- name: none
label: "Not Toxic"
description: "No harmful content"
- name: mild
label: "Mildly Toxic"
description: "Rude or insensitive but not severe"
- name: moderate
label: "Moderately Toxic"
description: "Clearly offensive or harmful"
- name: severe
label: "Severely Toxic"
description: "Extremely offensive, threatening, or dangerous"독성 범주
annotation_schemes:
- annotation_type: multiselect
name: toxicity_types
question: "Select all types of toxicity present"
options:
- name: profanity
label: "Profanity/Obscenity"
description: "Swear words, vulgar language"
- name: insult
label: "Insults"
description: "Personal attacks, name-calling"
- name: threat
label: "Threats"
description: "Threats of violence or harm"
- name: hate_speech
label: "Hate Speech"
description: "Targeting protected groups"
- name: harassment
label: "Harassment"
description: "Targeted, persistent hostility"
- name: sexual
label: "Sexual Content"
description: "Explicit or suggestive content"
- name: self_harm
label: "Self-Harm/Suicide"
description: "Promoting or glorifying self-harm"
- name: misinformation
label: "Misinformation"
description: "Demonstrably false claims"
- name: spam
label: "Spam/Scam"
description: "Unwanted promotional content"혐오 발언 탐지
대상 집단
annotation_schemes:
- annotation_type: multiselect
name: target_groups
question: "Which groups are targeted? (if hate speech detected)"
depends_on:
field: toxicity_types
contains: hate_speech
options:
- name: race_ethnicity
label: "Race/Ethnicity"
- name: religion
label: "Religion"
- name: gender
label: "Gender"
- name: sexual_orientation
label: "Sexual Orientation"
- name: disability
label: "Disability"
- name: nationality
label: "Nationality/Origin"
- name: age
label: "Age"
- name: other
label: "Other Protected Group"혐오 발언 심각도
annotation_schemes:
- annotation_type: radio
name: hate_severity
question: "Severity of hate speech"
depends_on:
field: toxicity_types
contains: hate_speech
options:
- name: implicit
label: "Implicit"
description: "Coded language, dog whistles"
- name: explicit_mild
label: "Explicit - Mild"
description: "Clear but not threatening"
- name: explicit_severe
label: "Explicit - Severe"
description: "Dehumanizing, threatening, or violent"맥락 기반 모더레이션
플랫폼별 규칙
# Context affects what's acceptable
annotation_schemes:
- annotation_type: radio
name: context_appropriate
question: "Is this content appropriate for the platform context?"
context_info:
platform: "{{metadata.platform}}"
community: "{{metadata.community}}"
audience: "{{metadata.audience}}"
options:
- name: appropriate
label: "Appropriate for Context"
- name: borderline
label: "Borderline"
- name: inappropriate
label: "Inappropriate for Context"
- annotation_type: text
name: context_notes
question: "Explain your contextual reasoning"
depends_on:
field: context_appropriate
value: borderline의도 탐지
annotation_schemes:
- annotation_type: radio
name: intent
question: "What is the apparent intent?"
options:
- name: genuine_attack
label: "Genuine Attack"
description: "Intent to harm or offend"
- name: satire
label: "Satire/Parody"
description: "Mocking toxic behavior"
- name: quote
label: "Quote/Report"
description: "Reporting or discussing toxic content"
- name: reclaimed
label: "Reclaimed Language"
description: "In-group use of slurs"
- name: unclear
label: "Unclear Intent"이미지 콘텐츠 모더레이션
시각 콘텐츠 분류
annotation_schemes:
- annotation_type: multiselect
name: image_violations
question: "Select all policy violations"
options:
- name: nudity
label: "Nudity/Sexual Content"
- name: violence_graphic
label: "Graphic Violence"
- name: gore
label: "Gore/Disturbing Content"
- name: hate_symbols
label: "Hate Symbols"
- name: dangerous_acts
label: "Dangerous Acts"
- name: child_safety
label: "Child Safety Concern"
priority: critical
escalate: true
- name: none
label: "No Violations"
- annotation_type: radio
name: action_recommendation
question: "Recommended action"
options:
- name: approve
label: "Approve"
- name: age_restrict
label: "Age-Restrict"
- name: warning_label
label: "Add Warning Label"
- name: remove
label: "Remove"
- name: escalate
label: "Escalate to Specialist"품질 관리
quality_control:
# Calibration for subjective content
calibration:
enabled: true
frequency: 20 # Every 20 items
items: calibration/moderation_gold.json
feedback: true
recalibrate_on_drift: true
# High redundancy for borderline cases
redundancy:
annotations_per_item: 3
increase_for_borderline: 5
agreement_threshold: 0.67
# Expert escalation
escalation:
enabled: true
triggers:
- field: toxicity_level
value: severe
- field: image_violations
contains: child_safety
escalate_to: trust_safety_team
# Distribution monitoring
monitoring:
track_distribution: true
alert_on_skew: true
expected_distribution:
none: 0.4
mild: 0.3
moderate: 0.2
severe: 0.1전체 구성
annotation_task_name: "Content Moderation"
# Wellbeing first
wellbeing:
warnings:
enabled: true
message: "This task contains potentially offensive content."
breaks:
reminder_interval: 30
message: "Remember to take breaks."
limits:
max_session_duration: 90
max_items_per_session: 75
display:
# Blur sensitive content
image_display:
blur_by_default: true
click_to_reveal: true
# Show platform context
metadata_display:
show_fields: [platform, community, report_reason]
annotation_schemes:
# Toxicity level
- annotation_type: radio
name: toxicity
question: "Toxicity level"
options:
- name: none
label: "None"
- name: mild
label: "Mild"
- name: moderate
label: "Moderate"
- name: severe
label: "Severe"
# Categories
- annotation_type: multiselect
name: categories
question: "Types of harmful content (select all)"
options:
- name: hate
label: "Hate Speech"
- name: harassment
label: "Harassment"
- name: violence
label: "Violence/Threats"
- name: sexual
label: "Sexual Content"
- name: self_harm
label: "Self-Harm"
- name: spam
label: "Spam"
- name: none
label: "None"
# Confidence
- annotation_type: likert
name: confidence
question: "How confident are you?"
size: 5
min_label: "Uncertain"
max_label: "Very Confident"
# Notes
- annotation_type: text
name: notes
question: "Additional notes (optional)"
multiline: true
quality_control:
redundancy:
annotations_per_item: 3
calibration:
enabled: true
frequency: 25
escalation:
enabled: true
triggers:
- field: toxicity
value: severe
output_annotation_dir: annotations/
export_annotation_format: jsonl지침 작성하기
어노테이터 간 의견 차이는 대부분 모호한 지침으로 거슬러 올라가므로, 바로 여기에 들이는 노력이 보답을 줍니다. 어노테이터가 짐작하도록 내버려 두는 대신 "mild"와 "moderate"를 무엇이 가르는지 명확히 적으세요. 경계 사례를 보여주고, 각 사례가 왜 그 자리에 놓였는지 설명을 덧붙이세요. 같은 문장이 한 커뮤니티에서는 괜찮고 다른 곳에서는 위반일 수 있으므로, 플랫폼과 청중이 판단을 어떻게 바꾸는지 밝히세요. 그리고 의도가 정말로 불분명할 때 어떻게 해야 하는지를, 그런 일이 결코 없는 척하지 말고 어노테이터에게 알려주세요. 새로운 종류의 콘텐츠가 등장하면서 이 모든 것을 수정하게 되리라 예상하세요.
어노테이터 지원하기
한 사람을 하루 종일 독성 콘텐츠에 붙박아 두지 말고, 작업을 돌려 가며 맡기세요. 정신 건강 자원을 한 번의 클릭으로 닿을 수 있게 두고, 우려를 알릴 진짜 통로를 사람들에게 주며, 이것이 힘든 일임을 인정하세요. 실제로 그렇습니다.
기저의 분류 스키마가 어떻게 작동하는지는 텍스트 어노테이션 문서를 참고하세요. 품질 관리 가이드는 중복 처리와 보정을 더 자세히 다룹니다.
전체 문서는 /docs/core-concepts/annotation-schemes에서 확인하세요.