Skip to content
Docs/Features

Diversity Ordering

Embedding-based item diversification to maximize annotation variety.

Diversity Ordering

Diversity ordering uses sentence-transformer embeddings to cluster similar items together, then samples items round-robin from different clusters. This ensures annotators see diverse content rather than similar items in sequence.

Benefits

  • Reduce annotator fatigue from repetitive content
  • Improve annotation quality through varied context
  • Faster coverage of the full topic space

Quick Start

yaml
assignment_strategy: diversity_clustering
 
diversity_ordering:
  enabled: true
  prefill_count: 100

How It Works

  1. Startup: First N items are embedded using sentence-transformers and clustered with k-means
  2. Assignment: Items are sampled round-robin from clusters, ensuring variety
  3. Annotation: New items are embedded asynchronously as they're annotated
  4. Re-clustering: When a user has sampled from all clusters, the system reclusters

Configuration

yaml
diversity_ordering:
  enabled: true
 
  # Sentence-transformer model
  model_name: "all-MiniLM-L6-v2"
 
  # Clustering parameters
  num_clusters: 10
  items_per_cluster: 20
  auto_clusters: true           # Auto-calculate based on data size
 
  # Prefill on startup
  prefill_count: 100
  batch_size: 32
 
  # Re-clustering behavior
  recluster_threshold: 1.0      # Recluster when all clusters sampled
 
  # Order preservation
  preserve_visited: true
 
  # AI integration
  trigger_ai_prefetch: true

Configuration Reference

OptionTypeDefaultDescription
enabledbooleanfalseEnable diversity ordering
model_namestring"all-MiniLM-L6-v2"Sentence-transformers model
num_clustersinteger10Number of clusters (when auto_clusters=false)
items_per_clusterinteger20Target cluster size (when auto_clusters=true)
auto_clustersbooleantrueAutomatically calculate cluster count
prefill_countinteger100Items to embed at startup
batch_sizeinteger32Batch size for embedding computation
recluster_thresholdfloat1.0Fraction of clusters to sample before reclustering
preserve_visitedbooleantrueKeep visited/skipped items in place
trigger_ai_prefetchbooleantrueTrigger AI cache after reordering

Requirements

bash
pip install sentence-transformers scikit-learn

These are optional dependencies. Without them, the feature will be disabled with a warning.

Performance

  • Startup: ~10 seconds for 100 items, ~30 seconds for 500 items (first run; cached after)
  • Memory: ~1.5 KB per item (all-MiniLM-L6-v2), ~15 MB for 10,000 items
  • Cache: Embeddings persisted to disk in .diversity_cache/

Interaction with Other Features

  • AI Support: When trigger_ai_prefetch: true, AI hints are automatically prefetched for reordered items
  • Active Learning: Can be combined by starting with diversity clustering for initial coverage, then switching to active learning
  • Order Preservation: When preserve_visited: true, previously seen items maintain their position

Full Example

yaml
annotation_task_name: "Diversity Ordering Test"
 
assignment_strategy: diversity_clustering
 
diversity_ordering:
  enabled: true
  model_name: "all-MiniLM-L6-v2"
  num_clusters: 5
  auto_clusters: false
  prefill_count: 100
  batch_size: 16
  recluster_threshold: 1.0
  preserve_visited: true
 
annotation_schemes:
  - annotation_type: radio
    name: topic
    description: "What is the main topic of this text?"
    labels:
      - name: Sports
      - name: Technology
      - name: Food
      - name: Travel
      - name: Health

Further Reading

For implementation details, see the source documentation.