Diversity Ordering
Embedding-based item diversification to maximize annotation variety.
Diversity Ordering
Diversity ordering uses sentence-transformer embeddings to cluster similar items together, then samples items round-robin from different clusters. This ensures annotators see diverse content rather than similar items in sequence.
Benefits
- Reduce annotator fatigue from repetitive content
- Improve annotation quality through varied context
- Faster coverage of the full topic space
Quick Start
yaml
assignment_strategy: diversity_clustering
diversity_ordering:
enabled: true
prefill_count: 100How It Works
- Startup: First N items are embedded using sentence-transformers and clustered with k-means
- Assignment: Items are sampled round-robin from clusters, ensuring variety
- Annotation: New items are embedded asynchronously as they're annotated
- Re-clustering: When a user has sampled from all clusters, the system reclusters
Configuration
yaml
diversity_ordering:
enabled: true
# Sentence-transformer model
model_name: "all-MiniLM-L6-v2"
# Clustering parameters
num_clusters: 10
items_per_cluster: 20
auto_clusters: true # Auto-calculate based on data size
# Prefill on startup
prefill_count: 100
batch_size: 32
# Re-clustering behavior
recluster_threshold: 1.0 # Recluster when all clusters sampled
# Order preservation
preserve_visited: true
# AI integration
trigger_ai_prefetch: trueConfiguration Reference
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable diversity ordering |
model_name | string | "all-MiniLM-L6-v2" | Sentence-transformers model |
num_clusters | integer | 10 | Number of clusters (when auto_clusters=false) |
items_per_cluster | integer | 20 | Target cluster size (when auto_clusters=true) |
auto_clusters | boolean | true | Automatically calculate cluster count |
prefill_count | integer | 100 | Items to embed at startup |
batch_size | integer | 32 | Batch size for embedding computation |
recluster_threshold | float | 1.0 | Fraction of clusters to sample before reclustering |
preserve_visited | boolean | true | Keep visited/skipped items in place |
trigger_ai_prefetch | boolean | true | Trigger AI cache after reordering |
Requirements
bash
pip install sentence-transformers scikit-learnThese are optional dependencies. Without them, the feature will be disabled with a warning.
Performance
- Startup: ~10 seconds for 100 items, ~30 seconds for 500 items (first run; cached after)
- Memory: ~1.5 KB per item (all-MiniLM-L6-v2), ~15 MB for 10,000 items
- Cache: Embeddings persisted to disk in
.diversity_cache/
Interaction with Other Features
- AI Support: When
trigger_ai_prefetch: true, AI hints are automatically prefetched for reordered items - Active Learning: Can be combined by starting with diversity clustering for initial coverage, then switching to active learning
- Order Preservation: When
preserve_visited: true, previously seen items maintain their position
Full Example
yaml
annotation_task_name: "Diversity Ordering Test"
assignment_strategy: diversity_clustering
diversity_ordering:
enabled: true
model_name: "all-MiniLM-L6-v2"
num_clusters: 5
auto_clusters: false
prefill_count: 100
batch_size: 16
recluster_threshold: 1.0
preserve_visited: true
annotation_schemes:
- annotation_type: radio
name: topic
description: "What is the main topic of this text?"
labels:
- name: Sports
- name: Technology
- name: Food
- name: Travel
- name: HealthFurther Reading
- AI Support - AI label suggestions
- Active Learning - ML-based instance prioritization
- Option Highlighting - AI-assisted option guidance
For implementation details, see the source documentation.