Legal Document Annotation Best Practices
Annotate legal documents in Potato, contracts, court filings, and regulatory text, with span labeling, entity extraction, and privacy-first self-hosted deployment.
Legal text is its own kind of annotation problem. The structure is dense, the vocabulary is specialized, and a mislabel can carry real legal consequences. This guide covers how to annotate it well.
What makes legal annotation hard
The jargon alone means you need trained annotators, not crowd workers picked at random. Documents run long, with contracts sometimes stretching to hundreds of pages, and sections constantly point at other sections. Precision matters more than usual, since a sloppy span can change the meaning of a clause. And much of that meaning depends on the document type and the jurisdiction, so context is never optional.
For the underlying span and text annotation mechanics, see the source documentation.
Document segmentation
Breaking down long documents
annotation_task_name: "Legal Document Annotation"
display:
# Segment by section
segmentation:
enabled: true
method: section_headers
pattern: '^\d+\.\s+[A-Z]'
# Show document context
context:
show_previous_section: true
show_section_hierarchy: true
# Navigation
navigation:
show_outline: true
jump_to_section: trueSection-Level Annotation
data_files:
- contracts.json
item_properties:
id_key: id
text_key: text
preprocessing:
segment_by: sections
preserve_metadata: true
include_section_number: true
# Each section becomes an annotation item
# {
# "id": "contract_001_section_3.2",
# "text": "The Licensor grants...",
# "section_number": "3.2",
# "section_title": "License Grant",
# "document_id": "contract_001"
# }Legal Entity Recognition
Contract-Specific Entities
annotation_schemes:
- annotation_type: span
name: legal_entities
labels:
- name: PARTY
color: "#FECACA"
description: "Contracting parties (Licensor, Licensee, Company, etc.)"
- name: DEFINED_TERM
color: "#FDE68A"
description: "Defined terms (usually capitalized)"
- name: DATE
color: "#BBF7D0"
description: "Dates and time periods"
- name: MONETARY
color: "#C4B5FD"
description: "Dollar amounts, fees, penalties"
- name: OBLIGATION
color: "#BFDBFE"
description: "Must, shall, will obligations"
- name: CONDITION
color: "#FED7AA"
description: "If, unless, provided that conditions"
- name: REFERENCE
color: "#E0E7FF"
description: "References to other sections or documents"Obligation Detection
annotation_schemes:
- annotation_type: multiselect
name: obligation_type
question: "What type of obligation is this?"
options:
- name: performance
label: "Performance Obligation"
description: "Party must do something"
- name: payment
label: "Payment Obligation"
description: "Party must pay"
- name: restriction
label: "Restriction/Prohibition"
description: "Party must not do something"
- name: condition
label: "Conditional Obligation"
description: "Obligation triggered by condition"
- name: warranty
label: "Warranty/Representation"
description: "Statement of fact or promise"Clause Classification
Contract Clause Types
annotation_schemes:
- annotation_type: radio
name: clause_type
question: "What type of clause is this?"
options:
- name: definitions
label: "Definitions"
- name: grant
label: "Grant of Rights/License"
- name: consideration
label: "Consideration/Payment"
- name: term
label: "Term and Termination"
- name: representations
label: "Representations & Warranties"
- name: indemnification
label: "Indemnification"
- name: limitation
label: "Limitation of Liability"
- name: confidentiality
label: "Confidentiality"
- name: ip
label: "Intellectual Property"
- name: dispute
label: "Dispute Resolution"
- name: boilerplate
label: "Boilerplate/Miscellaneous"Risk Assessment
annotation_schemes:
- annotation_type: likert
name: risk_level
question: "Rate the risk level of this clause for [Party]"
min_label: "Low Risk"
max_label: "High Risk"
size: 5
- annotation_type: text
name: risk_notes
question: "Explain the risk factors"
multiline: true
required_if:
field: risk_level
operator: ">="
value: 4Court Document Annotation
Case Information Extraction
annotation_schemes:
- annotation_type: span
name: case_entities
labels:
- name: CASE_NUMBER
description: "Case identifier"
- name: COURT
description: "Court name and jurisdiction"
- name: JUDGE
description: "Presiding judge"
- name: PLAINTIFF
description: "Plaintiff/Petitioner"
- name: DEFENDANT
description: "Defendant/Respondent"
- name: ATTORNEY
description: "Attorneys/Legal representatives"
- name: LEGAL_CITATION
description: "Citations to cases, statutes, regulations"
- name: RULING
description: "Court's ruling or order"Argument Structure
annotation_schemes:
- annotation_type: span
name: argument_structure
labels:
- name: CLAIM
color: "#FECACA"
description: "Main claim or assertion"
- name: PREMISE
color: "#BBF7D0"
description: "Supporting premise"
- name: EVIDENCE
color: "#BFDBFE"
description: "Evidence cited"
- name: REBUTTAL
color: "#FED7AA"
description: "Counter-argument"
- name: CONCLUSION
color: "#E0E7FF"
description: "Conclusion drawn"Highlighting Legal Terms
display:
keyword_highlighting:
enabled: true
categories:
- name: obligation_words
color: "#FEE2E2"
keywords:
- shall
- must
- will
- agrees to
- is required to
- is obligated to
- name: permission_words
color: "#D1FAE5"
keywords:
- may
- is permitted to
- has the right to
- is entitled to
- name: prohibition_words
color: "#FEF3C7"
keywords:
- shall not
- must not
- may not
- is prohibited from
- name: condition_words
color: "#DBEAFE"
keywords:
- if
- unless
- provided that
- subject to
- contingent upon
- in the event thatQuality Control for Legal
quality_control:
# Require legal training
qualification:
required_training: legal_annotation_training
training_accuracy: 0.85
# Domain expertise check
attention_checks:
enabled: true
items:
- text: |
"Notwithstanding any provision herein to the contrary,
Licensee shall indemnify Licensor against all claims."
expected:
obligation_type: indemnification
obligated_party: "Licensee"
type: domain_knowledge
# High agreement required
redundancy:
annotations_per_item: 3
agreement_threshold: 0.8
on_disagreement: expert_review
# Expert review layer
expert_review:
enabled: true
review_threshold: 0.7
expert_users: [legal_expert_1, legal_expert_2]Complete Legal Annotation Config
annotation_task_name: "Contract Clause Analysis"
display:
text_display: html
# Section context
context:
show_document_metadata: true
show_section_hierarchy: true
# Legal term highlighting
keyword_highlighting:
enabled: true
categories:
- name: obligations
color: "#FEE2E2"
keywords: [shall, must, will, agrees]
- name: conditions
color: "#DBEAFE"
keywords: [if, unless, provided that, subject to]
- name: defined_terms
pattern: '\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+)*\b'
color: "#FEF3C7"
annotation_schemes:
# Clause type
- annotation_type: radio
name: clause_type
question: "Classify this clause"
options:
- name: license_grant
label: "License Grant"
- name: payment
label: "Payment/Consideration"
- name: term
label: "Term/Termination"
- name: indemnification
label: "Indemnification"
- name: limitation
label: "Limitation of Liability"
- name: confidentiality
label: "Confidentiality"
- name: other
label: "Other"
# Entity spans
- annotation_type: span
name: entities
labels:
- name: PARTY
color: "#FECACA"
- name: DEFINED_TERM
color: "#FDE68A"
- name: MONETARY
color: "#C4B5FD"
- name: DATE
color: "#BBF7D0"
- name: OBLIGATION
color: "#BFDBFE"
# Risk assessment
- annotation_type: likert
name: risk
question: "Risk level for the receiving party?"
size: 5
min_label: "Low"
max_label: "High"
# Key issues
- annotation_type: text
name: issues
question: "Note any unusual or problematic language"
multiline: true
quality_control:
redundancy:
annotations_per_item: 2
agreement_threshold: 0.75
qualification:
required_training: true
training_items: 20
training_accuracy: 0.8Writing annotator guidelines
Good guidelines for a legal task spell out the scope (which documents, which jurisdictions) and include a glossary so annotators read terms the same way you do. Be explicit about ambiguous language and about when a cross-reference should be annotated versus ignored. And say what counts as a correct span boundary, since "close enough" rarely is in this domain.
Best practices
Use annotators who actually know the domain, because legal labeling does not work without it. Break long documents into sections people can hold in their head. Highlight the legal language that carries weight so it does not get skimmed past. Keep redundancy high, since the cost of an error is high. Add an expert review layer where attorneys handle the edge cases. Define each label precisely. And give annotators the surrounding structure and related sections, because a clause in isolation can mislead.
Full documentation at /docs/core-concepts/annotation-schemes.