Active Learning: स्मार्ट तरीके से Annotate करें, कठिन नहीं

Active learning intelligently select करती है कि आगे कौन से items annotate करने हैं, human effort को वहाँ focus करते हुए जहाँ यह सबसे अधिक मायने रखता है। यह गाइड दिखाती है कि model quality बनाए रखते हुए annotation effort को 50% तक कैसे कम किया जाए।

Active Learning क्या है?

Data को randomly sample करने के बजाय annotate करने के लिए, active learning:

Current annotations पर एक model train करती है
उन items की पहचान करती है जहाँ model uncertain है
उन items को human annotation के लिए prioritize करती है
Efficiency को continuously सुधारते हुए दोहराती है

Active Learning का उपयोग क्यों करें?

Annotation cost कम करें: Same model quality के लिए कम items label करें
तेज़ iteration: जल्दी usable models प्राप्त करें
Expertise focus करें: कठिन cases पर human attention
बेहतर coverage: सुनिश्चित करें कि edge cases represented हों

Basic Active Learning Setup

yaml

annotation_task_name: "Active Learning Classification"
 
data_files:
  - "data/unlabeled_pool.json"
 
# Active learning configuration
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 1000  # Number of instances to reorder by uncertainty
  random_sample_percent: 0.1  # 10% random sampling to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [Positive, Negative, Neutral]

Uncertainty Sampling कैसे काम करती है

Potato की active learning उन items को prioritize करने के लिए uncertainty sampling का उपयोग करती है जहाँ classifier सबसे कम confident है। Classifier unlabeled instances के लिए labels predict करता है, और सबसे कम confidence scores वाले items annotation के लिए पहले प्रस्तुत किए जाते हैं।

classifier_name field अपने full module path का उपयोग करके किसी भी scikit-learn compatible classifier को specify करता है:

yaml

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"

अन्य classifier options में शामिल हैं:

sklearn.ensemble.RandomForestClassifier
sklearn.svm.SVC (probability=True के साथ)
sklearn.naive_bayes.MultinomialNB

Complete Configuration

yaml

annotation_task_name: "Active Learning for Sentiment"
 
data_files:
  - "data/reviews.json"
 
active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
 
  # Sampling settings
  max_instances_to_reorder: 2000  # Reorder top N by uncertainty
  random_sample_percent: 0.1  # 10% random to maintain diversity
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "Classify the sentiment"
    labels:
      - name: Positive
        key_value: "1"
      - name: Negative
        key_value: "2"
      - name: Neutral
        key_value: "3"
    required: true
 
annotation_guidelines:
  text: |
    ## Sentiment Classification
 
    Items are prioritized by model uncertainty.
    You may see more difficult or ambiguous cases.
 
    Focus on accuracy over speed.

Progress Monitor करना

Potato की built-in logging के माध्यम से annotation progress track करें। System log करता है कि कौन से instances select किए गए और उनके uncertainty scores, आपको active learning process monitor करने की अनुमति देता है।

Best Practices

Cold Start

Higher random_sample_percent set करके diverse random sampling से शुरू करें:

yaml

active_learning:
  enabled: true
  classifier_name: "sklearn.linear_model.LogisticRegression"
  random_sample_percent: 0.2  # 20% random for initial diversity

Reordering Scope नियंत्रित करना

max_instances_to_reorder का उपयोग करके नियंत्रित करें कि कितने instances uncertainty द्वारा ranked हैं। बड़ा value बेहतर selection प्रदान करता है लेकिन अधिक computation की आवश्यकता है:

yaml

active_learning:
  max_instances_to_reorder: 5000  # Rank top 5000 by uncertainty

Diversity बनाए रखना

random_sample_percent parameter सुनिश्चित करता है कि कुछ randomly sampled instances शामिल हों, model को केवल uncertain edge cases देखने से रोकता है:

yaml

active_learning:
  random_sample_percent: 0.1  # 10% random sampling

सफलता के लिए सुझाव

Diverse शुरू करें: Random initial sample edge cases को cover करता है
Accuracy monitor करें: समय के साथ model performance track करें
Over-optimize न करें: कुछ random sampling coverage बनाए रखती है
Annotator fatigue handle करें: कठिन items थकाने वाले होते हैं
Model checkpoints save करें: आवश्यकता पड़ने पर rollback enable करें

अगले कदम

Uncertain items को तेज़ करने के लिए AI suggestions जोड़ें
कठिन cases के लिए quality control सेट करें
Active learning के साथ crowdsourcing के बारे में जानें

पूर्ण active learning documentation /docs/features/active-learning पर।