Guides5 min read
法律文档标注最佳实践
标注合同、法庭文件和监管申报的专业技术,需要领域专业知识。
Potato Team·
法律文档标注最佳实践
法律文档因其复杂结构、领域特定术语和错误的高风险性,需要专门的标注方法。本指南涵盖了有效法律文本标注的策略。
法律标注的挑战
- 密集术语:法律术语需要经过培训的标注者
- 长文档:合同可能长达数百页
- 交叉引用:各章节相互引用
- 精确性要求:错误可能导致法律后果
- 上下文依赖:含义取决于文档类型和管辖权
文档分段
拆分长文档
yaml
annotation_task_name: "Legal Document Annotation"
display:
# Segment by section
segmentation:
enabled: true
method: section_headers
pattern: '^\d+\.\s+[A-Z]'
# Show document context
context:
show_previous_section: true
show_section_hierarchy: true
# Navigation
navigation:
show_outline: true
jump_to_section: true章节级标注
yaml
data_files:
- contracts.json
item_properties:
id_key: id
text_key: text
preprocessing:
segment_by: sections
preserve_metadata: true
include_section_number: true
# Each section becomes an annotation item
# {
# "id": "contract_001_section_3.2",
# "text": "The Licensor grants...",
# "section_number": "3.2",
# "section_title": "License Grant",
# "document_id": "contract_001"
# }法律实体识别
合同特定实体
yaml
annotation_schemes:
- annotation_type: span
name: legal_entities
labels:
- name: PARTY
color: "#FECACA"
description: "Contracting parties (Licensor, Licensee, Company, etc.)"
- name: DEFINED_TERM
color: "#FDE68A"
description: "Defined terms (usually capitalized)"
- name: DATE
color: "#BBF7D0"
description: "Dates and time periods"
- name: MONETARY
color: "#C4B5FD"
description: "Dollar amounts, fees, penalties"
- name: OBLIGATION
color: "#BFDBFE"
description: "Must, shall, will obligations"
- name: CONDITION
color: "#FED7AA"
description: "If, unless, provided that conditions"
- name: REFERENCE
color: "#E0E7FF"
description: "References to other sections or documents"义务检测
yaml
annotation_schemes:
- annotation_type: multiselect
name: obligation_type
question: "What type of obligation is this?"
options:
- name: performance
label: "Performance Obligation"
description: "Party must do something"
- name: payment
label: "Payment Obligation"
description: "Party must pay"
- name: restriction
label: "Restriction/Prohibition"
description: "Party must not do something"
- name: condition
label: "Conditional Obligation"
description: "Obligation triggered by condition"
- name: warranty
label: "Warranty/Representation"
description: "Statement of fact or promise"条款分类
合同条款类型
yaml
annotation_schemes:
- annotation_type: radio
name: clause_type
question: "What type of clause is this?"
options:
- name: definitions
label: "Definitions"
- name: grant
label: "Grant of Rights/License"
- name: consideration
label: "Consideration/Payment"
- name: term
label: "Term and Termination"
- name: representations
label: "Representations & Warranties"
- name: indemnification
label: "Indemnification"
- name: limitation
label: "Limitation of Liability"
- name: confidentiality
label: "Confidentiality"
- name: ip
label: "Intellectual Property"
- name: dispute
label: "Dispute Resolution"
- name: boilerplate
label: "Boilerplate/Miscellaneous"风险评估
yaml
annotation_schemes:
- annotation_type: likert
name: risk_level
question: "Rate the risk level of this clause for [Party]"
min_label: "Low Risk"
max_label: "High Risk"
size: 5
- annotation_type: text
name: risk_notes
question: "Explain the risk factors"
multiline: true
required_if:
field: risk_level
operator: ">="
value: 4法庭文件标注
案件信息提取
yaml
annotation_schemes:
- annotation_type: span
name: case_entities
labels:
- name: CASE_NUMBER
description: "Case identifier"
- name: COURT
description: "Court name and jurisdiction"
- name: JUDGE
description: "Presiding judge"
- name: PLAINTIFF
description: "Plaintiff/Petitioner"
- name: DEFENDANT
description: "Defendant/Respondent"
- name: ATTORNEY
description: "Attorneys/Legal representatives"
- name: LEGAL_CITATION
description: "Citations to cases, statutes, regulations"
- name: RULING
description: "Court's ruling or order"论证结构
yaml
annotation_schemes:
- annotation_type: span
name: argument_structure
labels:
- name: CLAIM
color: "#FECACA"
description: "Main claim or assertion"
- name: PREMISE
color: "#BBF7D0"
description: "Supporting premise"
- name: EVIDENCE
color: "#BFDBFE"
description: "Evidence cited"
- name: REBUTTAL
color: "#FED7AA"
description: "Counter-argument"
- name: CONCLUSION
color: "#E0E7FF"
description: "Conclusion drawn"法律术语高亮
yaml
display:
keyword_highlighting:
enabled: true
categories:
- name: obligation_words
color: "#FEE2E2"
keywords:
- shall
- must
- will
- agrees to
- is required to
- is obligated to
- name: permission_words
color: "#D1FAE5"
keywords:
- may
- is permitted to
- has the right to
- is entitled to
- name: prohibition_words
color: "#FEF3C7"
keywords:
- shall not
- must not
- may not
- is prohibited from
- name: condition_words
color: "#DBEAFE"
keywords:
- if
- unless
- provided that
- subject to
- contingent upon
- in the event that法律标注的质量控制
yaml
quality_control:
# Require legal training
qualification:
required_training: legal_annotation_training
training_accuracy: 0.85
# Domain expertise check
attention_checks:
enabled: true
items:
- text: |
"Notwithstanding any provision herein to the contrary,
Licensee shall indemnify Licensor against all claims."
expected:
obligation_type: indemnification
obligated_party: "Licensee"
type: domain_knowledge
# High agreement required
redundancy:
annotations_per_item: 3
agreement_threshold: 0.8
on_disagreement: expert_review
# Expert review layer
expert_review:
enabled: true
review_threshold: 0.7
expert_users: [legal_expert_1, legal_expert_2]完整法律标注配置
yaml
annotation_task_name: "Contract Clause Analysis"
display:
text_display: html
# Section context
context:
show_document_metadata: true
show_section_hierarchy: true
# Legal term highlighting
keyword_highlighting:
enabled: true
categories:
- name: obligations
color: "#FEE2E2"
keywords: [shall, must, will, agrees]
- name: conditions
color: "#DBEAFE"
keywords: [if, unless, provided that, subject to]
- name: defined_terms
pattern: '\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+)*\b'
color: "#FEF3C7"
annotation_schemes:
# Clause type
- annotation_type: radio
name: clause_type
question: "Classify this clause"
options:
- name: license_grant
label: "License Grant"
- name: payment
label: "Payment/Consideration"
- name: term
label: "Term/Termination"
- name: indemnification
label: "Indemnification"
- name: limitation
label: "Limitation of Liability"
- name: confidentiality
label: "Confidentiality"
- name: other
label: "Other"
# Entity spans
- annotation_type: span
name: entities
labels:
- name: PARTY
color: "#FECACA"
- name: DEFINED_TERM
color: "#FDE68A"
- name: MONETARY
color: "#C4B5FD"
- name: DATE
color: "#BBF7D0"
- name: OBLIGATION
color: "#BFDBFE"
# Risk assessment
- annotation_type: likert
name: risk
question: "Risk level for the receiving party?"
size: 5
min_label: "Low"
max_label: "High"
# Key issues
- annotation_type: text
name: issues
question: "Note any unusual or problematic language"
multiline: true
quality_control:
redundancy:
annotations_per_item: 2
agreement_threshold: 0.75
qualification:
required_training: true
training_items: 20
training_accuracy: 0.8标注者指南示例
创建法律标注指南时:
- 定义范围:哪些文档,哪些管辖权
- 术语表:为标注者定义法律术语
- 边缘案例:如何处理模糊语言
- 交叉引用:何时标注 vs 忽略引用
- 精确性要求:精确的 span 边界
最佳实践
- 使用经过培训的标注者:法律标注需要领域知识
- 分段长文档:拆分为可管理的章节
- 高亮关键术语:引导注意力到法律语言
- 高冗余度:法律错误代价高昂
- 专家审核层:让律师审核边缘案例
- 清晰的指南:明确定义每个标签的含义
- 上下文标注:展示文档结构和相关章节
完整文档请参阅 /docs/core-concepts/annotation-types。