Solo Mode: एक Annotator कैसे 10,000 Examples Label कर सकता है

आपके पास sentiment (Positive, Neutral, Negative) के लिए label करने के लिए 10,000 product reviews हैं। सब कुछ label करने के लिए तीन annotators hire करने में हफ्ते लगेंगे और हजारों डॉलर खर्च होंगे। Solo Mode के साथ, एक single domain expert केवल 500-1,000 instances label करके comparable quality प्राप्त कर सकता है, जबकि एक LLM बाकी को handle करता है -- हर उस निर्णय की human समीक्षा के साथ जिसके बारे में LLM अनिश्चित हो।

यह tutorial पूरी प्रक्रिया को end-to-end चलाती है।

आपको क्या चाहिए होगा

Solo Mode extras के साथ Potato 2.3.0+: pip install potato-annotation[solo]
एक OpenAI या Anthropic API key (LLM component के लिए)
आपका dataset JSONL format में
एक जानकार annotator (जो आप स्वयं हो सकते हैं)

चरण 1: अपना Data तैयार करें

प्रति पंक्ति एक review के साथ data/reviews.jsonl बनाएं:

json

{"id": "rev_001", "text": "Absolutely love this product! Best purchase I've made all year.", "source": "amazon"}
{"id": "rev_002", "text": "It works fine. Nothing special but gets the job done.", "source": "amazon"}
{"id": "rev_003", "text": "Broke after two weeks. Complete waste of money.", "source": "amazon"}
{"id": "rev_004", "text": "The quality is decent for the price point. I might buy again.", "source": "amazon"}
{"id": "rev_005", "text": "Arrived damaged and customer service was unhelpful.", "source": "amazon"}

इस tutorial के लिए, imagine करें कि इस फ़ाइल में 10,000 reviews हैं।

चरण 2: Configuration बनाएं

config.yaml बनाएं:

yaml

task_name: "Product Review Sentiment (Solo Mode)"
task_dir: "."
 
data_files:
  - "data/reviews.jsonl"
 
item_properties:
  id_key: id
  text_key: text
 
# --- Solo Mode Configuration ---
solo_mode:
  enabled: true
 
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    temperature: 0.1
    max_tokens: 64
 
  # Quality targets
  seed_count: 50
  accuracy_threshold: 0.93
  confidence_threshold: 0.85
 
  # Phase-specific settings
  phases:
    seed:
      count: 50
      selection: diversity
      embedding_model: "all-MiniLM-L6-v2"
 
    calibration:
      batch_size: 200
      holdout_fraction: 0.2
 
    labeling_functions:
      enabled: true
      max_functions: 15
      min_precision: 0.92
      min_coverage: 0.01
 
    active_labeling:
      batch_size: 25
      strategy: hybrid
      max_batches: 15
 
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02
 
    disagreement_exploration:
      max_instances: 150
      show_llm_reasoning: true
      show_nearest_neighbors: 3
 
    edge_case_synthesis:
      enabled: true
      count: 30
 
    confidence_escalation:
      escalation_budget: 150
      batch_size: 25
      stop_when_stable: true
 
    prompt_optimization:
      enabled: true
      candidates: 8
      metric: f1_macro
 
    final_validation:
      sample_size: 100
      min_accuracy: 0.93
 
  # Instance prioritization
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05
 
# --- Annotation Schema ---
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the overall sentiment of this review?"
    labels:
      - "Positive"
      - "Neutral"
      - "Negative"
    label_requirement:
      required: true
    sequential_key_binding: true
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"

चरण 3: Server शुरू करें

bash

potato start config.yaml -p 8000

http://localhost:8000 खोलें और log in करें। Solo Mode dashboard दिखाई देगा, जो दर्शाता है कि आप Phase 1: Seed Annotation में हैं।

चरण 4: Phase 1 -- Seed Annotation (50 Instances)

Potato ने embedding-based clustering का उपयोग करके 50 diverse reviews select किए हैं। ये random नहीं हैं; इन्हें आपके data distribution की coverage को maximize करने के लिए चुना गया है।

प्रत्येक को label करें। यह सबसे महत्वपूर्ण phase है -- आपके seed labels की quality निर्धारित करती है कि LLM कितनी अच्छी तरह सीखेगा। अपना समय लें और consistent रहें।

समय का अनुमान: प्रति instance 20-30 सेकंड की दर से 15-25 मिनट।

जब आप 50वां instance पूरा करते हैं, Potato स्वचालित रूप से Phase 2 पर आगे बढ़ता है।

चरण 5: Phase 2 -- Initial LLM Calibration

यह phase स्वचालित रूप से चलती है। Potato LLM को आपके 50 seed labels के साथ few-shot examples के रूप में 200 instances का एक batch भेजता है। फिर यह baseline accuracy का अनुमान लगाने के लिए 10 held-out seed labels के विरुद्ध LLM की predictions की तुलना करता है।

आप dashboard में एक progress indicator देखेंगे। LLM provider के आधार पर इसमें आमतौर पर 1-2 मिनट लगते हैं।

सामान्य परिणाम: LLM पहले calibration पर 75-85% accuracy प्राप्त करता है। यह expected है -- LLM ने अभी तक आपकी specific annotation style नहीं सीखी है।

चरण 6: Phase 3 -- Confusion Analysis

Potato एक confusion matrix प्रदर्शित करता है जो दिखाता है कि LLM आपके labels से कहाँ असहमत है। एक सामान्य output:

text

Confusion Analysis (Round 1)
============================
Overall Accuracy: 0.82 (target: 0.93)

Top Confusion Pairs:
  Neutral -> Positive:  14 instances (7.0%)
  Negative -> Neutral:   9 instances (4.5%)
  Positive -> Neutral:   4 instances (2.0%)

यह आपको LLM की मुख्य कमज़ोरी बताता है: यह neutral reviews को positive में upgrade करने की प्रवृत्ति रखता है। यह सामान्य है -- LLMs अक्सर positive sentiment की ओर biased होते हैं।

आपकी action: Confusion pairs की समीक्षा करें। प्रत्येक pair पर click करें उन specific instances देखने के लिए जिन्हें LLM ने गलत किया। यह आपको LLM के failure modes को समझने में मदद करता है।

चरण 7: Phase 4 -- Guideline Refinement

Confusion analysis के आधार पर, Potato LLM के लिए refined guidelines generate करता है। आप एक side-by-side view देखते हैं:

Current guidelines: LLM के लिए उपयोग किया गया initial prompt
Suggested edits: Error patterns के आधार पर LLM द्वारा प्रस्तावित specific changes

उदाहरण के लिए, Potato यह जोड़ने का सुझाव दे सकता है:

"Reviews that describe a product as 'fine', 'okay', or 'decent' without strong emotion should be labeled Neutral, even if they mention buying again."

प्रत्येक suggested edit की समीक्षा करें। प्रत्येक को approve, modify, या reject करें। आप अपने स्वयं के clarifications भी जोड़ सकते हैं।

समय का अनुमान: 5-10 मिनट।

चरण 8: Phase 5 -- Labeling Function Generation

Potato आपके seed labels में patterns से programmatic labeling functions generate करता है। ये fast, deterministic rules हैं जो आसान cases को handle करती हैं:

text

Generated Labeling Functions:
  LF1: Strong positive words (love, amazing, best, excellent)
       Precision: 0.97, Coverage: 0.06
  LF2: Strong negative words (terrible, awful, worst, waste)
       Precision: 0.95, Coverage: 0.04
  LF3: Exclamation + positive adjective
       Precision: 0.94, Coverage: 0.03
  LF4: Return/refund mention + negative context
       Precision: 0.92, Coverage: 0.02
  ...
  Total coverage: 0.18 (1,800 of 10,000 instances)

Labeling functions 92%+ precision के साथ आपके dataset का 18% cover करती हैं। ये instances स्वचालित रूप से labeled हो जाते हैं, LLM और human effort को कठिन cases के लिए free करते हुए।

आपकी action: Generated functions की समीक्षा करें। जो अविश्वसनीय लगें उन्हें disable करें। यह optional है -- Potato केवल आपके configured precision threshold से ऊपर वाले functions को रखता है।

चरण 9: Phase 6 -- Active Labeling (125-375 Instances)

यह main human labeling phase है। Potato छह-pool prioritization system का उपयोग करके instances select करता है:

Uncertain (30%): Reviews जहाँ LLM की confidence 85% से नीचे है
Disagreement (25%): Reviews जहाँ LLM और labeling functions अलग-अलग labels देते हैं
Boundary (20%): Embedding space में decision boundary के निकट reviews
Novel (10%): Reviews जो आपके अब तक label किए गए किसी भी चीज़ के विपरीत हैं
Error pattern (10%): Known confusion patterns से मेल खाते reviews (जैसे, lukewarm-positive)
Random (5%): Calibration के लिए random reviews

आप इन्हें 25 के batches में label करते हैं। प्रत्येक batch के बाद, Potato LLM की accuracy estimate update करता है और तय करता है कि जारी रखना है या नहीं।

सामान्य trajectory:

Batch 1-3 (75 instances): Accuracy 82% से 87% तक बढ़ती है
Batch 4-6 (150 instances): Accuracy 90% तक पहुँचती है
Batch 7-10 (250 instances): Accuracy 91-92% पर plateau करती है

यदि accuracy 93% (आपका threshold) तक पहुँचती है, तो Solo Mode Phase 10 पर jump करता है। अन्यथा, यह Phase 7 पर जारी रहता है।

समय का अनुमान: कुल 45-90 मिनट, यह इस बात पर निर्भर करता है कि कितने batches की आवश्यकता है।

चरण 10: Phase 7 -- Automated Refinement Loop

यदि active labeling के बाद accuracy अभी भी threshold से नीचे है, तो Potato refinement loop का एक और दौर चलाता है:

LLM updated guidelines और अधिक few-shot examples के साथ full dataset को फिर से label करता है
सभी human labels के विरुद्ध Accuracy फिर से compute होती है
नए confusion patterns पहचाने जाते हैं
Guidelines फिर से refined होती हैं

यह phase mostly automatic है। आपको केवल guideline changes approve करने होते हैं।

सामान्य परिणाम: प्रति refinement round accuracy 2-4% सुधरती है।

चरण 11: Phase 8 -- Disagreement Exploration

Potato सबसे विवादास्पद instances प्रस्तुत करता है: ऐसे cases जहाँ LLM, labeling functions, और nearest-neighbor analysis सभी अलग-अलग उत्तर देते हैं। प्रत्येक instance के लिए, आप देखते हैं:

Review text
LLM prediction और confidence
Labeling function votes
अपने labels के साथ 3 nearest labeled examples
LLM की chain-of-thought reasoning

ये genuinely कठिन cases हैं। यहाँ आपके labels की पूरी प्रक्रिया में किसी भी annotation की सबसे अधिक marginal value है।

समय का अनुमान: 100-150 instances के लिए 20-30 मिनट।

चरण 12: Phase 9 -- Edge Case Synthesis

Potato remaining confusion patterns को target करते हुए synthetic reviews generate करता है। उदाहरण के लिए, यदि LLM अभी भी "neutral reviews जो buying again का उल्लेख करते हैं" से struggle करता है, तो यह ऐसे examples generate करता है:

"It's an okay product for the price. I might get another one if there's a sale."

आप इन synthetic examples को label करते हैं, और इन्हें LLM के few-shot context में जोड़ा जाता है।

समय का अनुमान: 30 examples के लिए 10-15 मिनट।

चरण 13: Phase 10 -- Cascaded Confidence Escalation

LLM ने अब dataset के अधिकांश को label किया है। Potato सभी LLM-labeled instances को confidence द्वारा rank करता है और सबसे कम-confidence वाले को 25 के batches में आपको भेजता है।

text

Confidence Escalation Progress:
  Batch 1: 25 instances, 23/25 correct (92%)
  Batch 2: 25 instances, 24/25 correct (96%)
  Batch 3: 25 instances, 25/25 correct (100%)
  -> Stopping: last 3 batches stable

जब आप तीन consecutive batches देखते हैं जहाँ LLM ने सब कुछ सही किया, Solo Mode यह निष्कर्ष निकालता है कि शेष high-confidence labels भरोसेमंद हैं।

समय का अनुमान: 15-20 मिनट।

चरण 14: Phase 11 -- Prompt Optimization

यह phase स्वचालित रूप से चलती है। Potato 8 prompt variants try करता है और आपके accumulated human labels पर highest F1 score वाले को select करता है:

text

Prompt Optimization Results:
  Variant 1 (direct, 5 examples):     F1=0.91
  Variant 2 (CoT, 5 examples):        F1=0.93
  Variant 3 (direct, 10 examples):    F1=0.92
  Variant 4 (CoT, 10 examples):       F1=0.94  <-- selected
  Variant 5 (direct, 15 examples):    F1=0.92
  Variant 6 (CoT, 15 examples):       F1=0.93
  Variant 7 (self-consistency, 5x):   F1=0.94
  Variant 8 (self-consistency, 10x):  F1=0.94

Final re-labeling pass के लिए best prompt का उपयोग किया जाता है।

चरण 15: Phase 12 -- Final Validation

Potato आपके द्वारा review करने के लिए 100 random LLM-labeled instances select करता है। आप उन्हें label करते हैं, और Potato LLM के labels के विरुद्ध compare करता है।

text

Final Validation:
  Reviewed: 100 instances
  LLM correct: 94/100 (94%)
  Threshold: 93%
  -> PASSED

यदि LLM की accuracy आपका threshold पूरा करती है, तो dataset complete है। यदि नहीं, तो Solo Mode active labeling के एक और दौर के लिए Phase 6 पर cycle back करता है।

समय का अनुमान: 10-15 मिनट।

Results Summary

सभी 12 phases चलाने के बाद, final statistics check करें:

bash

python -m potato.solo status --config config.yaml

text

Solo Mode Complete
==================
Dataset: 10,000 instances
Total human labels: 612
  Seed: 50
  Active labeling: 275
  Disagreement exploration: 137
  Edge case synthesis: 30
  Confidence escalation: 75
  Final validation: 45

LLM labels: 8,200 (accuracy: 94.1%)
LF labels: 1,800 (precision: 95.3%)
Unlabeled: 0

Final label distribution:
  Positive: 4,823 (48.2%)
  Neutral:  3,011 (30.1%)
  Negative: 2,166 (21.7%)

Total human time: ~3.5 hours
Estimated multi-annotator cost (3x): ~$4,500
Solo Mode cost: ~$450 (API fees) + ~$175 (annotator time)
Savings: ~88%

Human ने 10,000 में से 612 instances (6.1%) label किए। LLM और labeling functions ने बाकी को 94%+ accuracy पर handle किया।

Results Export करें

Final labeled dataset export करें:

bash

python -m potato.solo export --config config.yaml --output final_labels.jsonl

प्रत्येक पंक्ति में label और उसका source शामिल है:

json

{"id": "rev_001", "sentiment": "Positive", "source": "human", "confidence": 1.0}
{"id": "rev_002", "sentiment": "Neutral", "source": "llm", "confidence": 0.91}
{"id": "rev_003", "sentiment": "Negative", "source": "labeling_function", "confidence": 0.97}

Parquet export के लिए:

python

import pandas as pd
df = pd.read_parquet("output/parquet/annotations.parquet")
print(df["value"].value_counts())

Quality Assurance: Hybrid Verification

Publication-quality datasets के लिए, एक sample review करने के लिए दूसरा annotator जोड़ें:

yaml

solo_mode:
  verification:
    enabled: true
    sample_fraction: 0.10
    annotator: "reviewer_1"

यह 1,000 random instances को एक second annotator को assign करता है। आप फिर Solo Mode labels और reviewer के labels के बीच inter-annotator agreement compute कर सकते हैं।

Troubleshooting

LLM accuracy threshold से नीचे plateau हो जाती है

Seed count बढ़ाएं: 50 की बजाय 75-100 seed instances try करें
LLM बदलें: GPT-4o की बजाय claude-sonnet-4-20250514 try करें (या इसके विपरीत)
Threshold कम करें: यदि 93% achievable नहीं है, तो विचार करें कि क्या आपके use case के लिए 90% acceptable है
अपना data check करें: कुछ datasets inherently ambiguous होते हैं। यदि human-human agreement केवल 90% होती, तो LLM से बेहतर की उम्मीद न करें

Phase 6 में बहुत अधिक batches लगते हैं

Batch size बढ़ाएं: batch_size को 25 से 50 में बदलें
Pool weights adjust करें: यदि most escalated instances "uncertain" pool से हैं, तो इसका weight कम करें और "disagreement" और "error_pattern" बढ़ाएं

Labeling functions की coverage कम है

यह strong lexical signals के बिना tasks के लिए सामान्य है (जैसे, sarcasm detection, implicit sentiment)
Labeling functions explicit, keyword-driven patterns के लिए सबसे अच्छा काम करती हैं
Solo Mode बिना labeling functions के भी काम करता है -- LLM slack को उठाता है

आगे पढ़ें

Solo Mode Documentation -- पूर्ण configuration reference
Active Learning -- underlying selection algorithm
AI Support -- LLM provider configuration
Quality Control -- additional quality assurance options
Parquet Export -- कुशल data export