Diversity Ordering

Reorder annotation instances in Potato using embedding-based diversity scoring to maximize dataset coverage and reduce redundancy in large unlabeled corpora.

Diversity ordering uses sentence-transformer embeddings to cluster similar items together, then samples items round-robin from different clusters. Annotators see diverse content rather than similar items in sequence.

Benefits

Reduce annotator fatigue from repetitive content
Improve annotation quality through varied context
Faster coverage of the full topic space

Quick Start

yaml

assignment_strategy: diversity_clustering
 
diversity_ordering:
  enabled: true
  prefill_count: 100

How It Works

Startup: First N items are embedded using sentence-transformers and clustered with k-means
Assignment: Items are sampled round-robin from clusters for variety
Annotation: New items are embedded asynchronously as they're annotated
Re-clustering: When a user has sampled from all clusters, the system reclusters

Configuration

yaml

diversity_ordering:
  enabled: true
 
  # Sentence-transformer model
  model_name: "all-MiniLM-L6-v2"
 
  # Clustering parameters
  num_clusters: 10
  items_per_cluster: 20
  auto_clusters: true           # Auto-calculate based on data size
 
  # Prefill on startup
  prefill_count: 100
  batch_size: 32
 
  # Re-clustering behavior
  recluster_threshold: 1.0      # Recluster when all clusters sampled
 
  # Order preservation
  preserve_visited: true
 
  # AI integration
  trigger_ai_prefetch: true

Configuration Reference

Option	Type	Default	Description
`enabled`	boolean	`false`	Enable diversity ordering
`model_name`	string	`"all-MiniLM-L6-v2"`	Sentence-transformers model
`num_clusters`	integer	`10`	Number of clusters (when `auto_clusters=false`)
`items_per_cluster`	integer	`20`	Target cluster size (when `auto_clusters=true`)
`auto_clusters`	boolean	`true`	Automatically calculate cluster count
`prefill_count`	integer	`100`	Items to embed at startup
`batch_size`	integer	`32`	Batch size for embedding computation
`recluster_threshold`	float	`1.0`	Fraction of clusters to sample before reclustering
`preserve_visited`	boolean	`true`	Keep visited/skipped items in place
`trigger_ai_prefetch`	boolean	`true`	Trigger AI cache after reordering

Requirements

bash

pip install sentence-transformers scikit-learn

These are optional dependencies. Without them, the feature will be disabled with a warning.

Performance

Startup: ~10 seconds for 100 items, ~30 seconds for 500 items (first run; cached after)
Memory: ~1.5 KB per item (all-MiniLM-L6-v2), ~15 MB for 10,000 items
Cache: Embeddings persisted to disk in .diversity_cache/

Interaction with Other Features

AI Support: When trigger_ai_prefetch: true, AI hints are automatically prefetched for reordered items
Active Learning: Can be combined by starting with diversity clustering for initial coverage, then switching to active learning
Order Preservation: When preserve_visited: true, previously seen items maintain their position

Full Example

yaml

annotation_task_name: "Diversity Ordering Test"
 
assignment_strategy: diversity_clustering
 
diversity_ordering:
  enabled: true
  model_name: "all-MiniLM-L6-v2"
  num_clusters: 5
  auto_clusters: false
  prefill_count: 100
  batch_size: 16
  recluster_threshold: 1.0
  preserve_visited: true
 
annotation_schemes:
  - annotation_type: radio
    name: topic
    description: "What is the main topic of this text?"
    labels:
      - name: Sports
      - name: Technology
      - name: Food
      - name: Travel
      - name: Health

Diversity Ordering

Benefits

Quick Start

How It Works

Configuration

Configuration Reference

Requirements

Performance

Interaction with Other Features

Full Example

Further Reading