Solo Mode: How One Annotator Can Label 10,000 Examples
Step-by-step tutorial on using Potato's Solo Mode to efficiently label large datasets with human-LLM collaboration, reducing annotation cost by up to 90%.
Solo Mode: How One Annotator Can Label 10,000 Examples
You have 10,000 product reviews to label for sentiment (Positive, Neutral, Negative). Hiring three annotators to label everything would take weeks and cost thousands of dollars. With Solo Mode, a single domain expert can achieve comparable quality by labeling only 500-1,000 instances while an LLM handles the rest -- with the human reviewing every decision the LLM is uncertain about.
This tutorial walks through the entire process end to end.
What You Will Need
- Potato 2.3.0+ with the Solo Mode extras:
pip install potato-annotation[solo] - An OpenAI or Anthropic API key (for the LLM component)
- Your dataset in JSONL format
- One knowledgeable annotator (that could be you)
Step 1: Prepare Your Data
Create data/reviews.jsonl with one review per line:
{"id": "rev_001", "text": "Absolutely love this product! Best purchase I've made all year.", "source": "amazon"}
{"id": "rev_002", "text": "It works fine. Nothing special but gets the job done.", "source": "amazon"}
{"id": "rev_003", "text": "Broke after two weeks. Complete waste of money.", "source": "amazon"}
{"id": "rev_004", "text": "The quality is decent for the price point. I might buy again.", "source": "amazon"}
{"id": "rev_005", "text": "Arrived damaged and customer service was unhelpful.", "source": "amazon"}For this tutorial, imagine this file contains 10,000 reviews.
Step 2: Create the Configuration
Create config.yaml:
task_name: "Product Review Sentiment (Solo Mode)"
task_dir: "."
data_files:
- "data/reviews.jsonl"
item_properties:
id_key: id
text_key: text
# --- Solo Mode Configuration ---
solo_mode:
enabled: true
llm:
endpoint_type: openai
model: "gpt-4o"
api_key: ${OPENAI_API_KEY}
temperature: 0.1
max_tokens: 64
# Quality targets
seed_count: 50
accuracy_threshold: 0.93
confidence_threshold: 0.85
# Phase-specific settings
phases:
seed:
count: 50
selection: diversity
embedding_model: "all-MiniLM-L6-v2"
calibration:
batch_size: 200
holdout_fraction: 0.2
labeling_functions:
enabled: true
max_functions: 15
min_precision: 0.92
min_coverage: 0.01
active_labeling:
batch_size: 25
strategy: hybrid
max_batches: 15
refinement_loop:
max_iterations: 3
improvement_threshold: 0.02
disagreement_exploration:
max_instances: 150
show_llm_reasoning: true
show_nearest_neighbors: 3
edge_case_synthesis:
enabled: true
count: 30
confidence_escalation:
escalation_budget: 150
batch_size: 25
stop_when_stable: true
prompt_optimization:
enabled: true
candidates: 8
metric: f1_macro
final_validation:
sample_size: 100
min_accuracy: 0.93
# Instance prioritization
prioritization:
pools:
- name: uncertain
weight: 0.30
- name: disagreement
weight: 0.25
- name: boundary
weight: 0.20
- name: novel
weight: 0.10
- name: error_pattern
weight: 0.10
- name: random
weight: 0.05
# --- Annotation Schema ---
annotation_schemes:
- annotation_type: radio
name: sentiment
description: "What is the overall sentiment of this review?"
labels:
- "Positive"
- "Neutral"
- "Negative"
label_requirement:
required: true
sequential_key_binding: true
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
parquet_export:
enabled: true
output_dir: "output/parquet/"Step 3: Start the Server
potato start config.yaml -p 8000Open http://localhost:8000 and log in. The Solo Mode dashboard will appear, showing you are in Phase 1: Seed Annotation.
Step 4: Phase 1 -- Seed Annotation (50 Instances)
Potato has selected 50 diverse reviews using embedding-based clustering. These are not random; they are chosen to maximize coverage of your data distribution.
Label each one. This is the most important phase -- the quality of your seed labels determines how well the LLM will learn. Take your time and be consistent.
Time estimate: 15-25 minutes at 20-30 seconds per instance.
When you finish the 50th instance, Potato automatically advances to Phase 2.
Step 5: Phase 2 -- Initial LLM Calibration
This phase runs automatically. Potato sends the LLM a batch of 200 instances with your 50 seed labels as few-shot examples. It then compares the LLM's predictions against 10 held-out seed labels to estimate baseline accuracy.
You will see a progress indicator in the dashboard. This typically takes 1-2 minutes depending on the LLM provider.
Typical result: The LLM achieves 75-85% accuracy on the first calibration. This is expected -- the LLM has not yet learned your specific annotation style.
Step 6: Phase 3 -- Confusion Analysis
Potato displays a confusion matrix showing where the LLM disagrees with your labels. A typical output:
Confusion Analysis (Round 1)
============================
Overall Accuracy: 0.82 (target: 0.93)
Top Confusion Pairs:
Neutral -> Positive: 14 instances (7.0%)
Negative -> Neutral: 9 instances (4.5%)
Positive -> Neutral: 4 instances (2.0%)
This tells you the LLM's main weakness: it tends to upgrade neutral reviews to positive. This is common -- LLMs are often biased toward positive sentiment.
Your action: Review the confusion pairs. Click on each pair to see the specific instances the LLM got wrong. This helps you understand the LLM's failure modes.
Step 7: Phase 4 -- Guideline Refinement
Based on the confusion analysis, Potato generates refined guidelines for the LLM. You see a side-by-side view:
- Current guidelines: The initial prompt used for the LLM
- Suggested edits: Specific changes the LLM proposes based on error patterns
For example, Potato might suggest adding:
"Reviews that describe a product as 'fine', 'okay', or 'decent' without strong emotion should be labeled Neutral, even if they mention buying again."
Review each suggested edit. Approve, modify, or reject each one. You can also add your own clarifications.
Time estimate: 5-10 minutes.
Step 8: Phase 5 -- Labeling Function Generation
Potato generates programmatic labeling functions from patterns in your seed labels. These are fast, deterministic rules that handle easy cases:
Generated Labeling Functions:
LF1: Strong positive words (love, amazing, best, excellent)
Precision: 0.97, Coverage: 0.06
LF2: Strong negative words (terrible, awful, worst, waste)
Precision: 0.95, Coverage: 0.04
LF3: Exclamation + positive adjective
Precision: 0.94, Coverage: 0.03
LF4: Return/refund mention + negative context
Precision: 0.92, Coverage: 0.02
...
Total coverage: 0.18 (1,800 of 10,000 instances)
Labeling functions cover 18% of your dataset with 92%+ precision. These instances are labeled automatically, freeing the LLM and human effort for harder cases.
Your action: Review the generated functions. Disable any that seem unreliable. This is optional -- Potato only keeps functions above your configured precision threshold.
Step 9: Phase 6 -- Active Labeling (125-375 Instances)
This is the main human labeling phase. Potato selects instances using the six-pool prioritization system:
- Uncertain (30%): Reviews where the LLM's confidence is below 85%
- Disagreement (25%): Reviews where the LLM and labeling functions give different labels
- Boundary (20%): Reviews near the decision boundary in embedding space
- Novel (10%): Reviews unlike anything you have labeled so far
- Error pattern (10%): Reviews matching known confusion patterns (e.g., lukewarm-positive)
- Random (5%): Random reviews for calibration
You label these in batches of 25. After each batch, Potato updates the LLM's accuracy estimate and decides whether to continue.
Typical trajectory:
- Batch 1-3 (75 instances): Accuracy climbs from 82% to 87%
- Batch 4-6 (150 instances): Accuracy reaches 90%
- Batch 7-10 (250 instances): Accuracy plateaus at 91-92%
If accuracy reaches 93% (your threshold), Solo Mode jumps ahead to Phase 10. Otherwise, it continues to Phase 7.
Time estimate: 45-90 minutes total, depending on how many batches are needed.
Step 10: Phase 7 -- Automated Refinement Loop
If accuracy is still below threshold after active labeling, Potato runs another round of the refinement loop:
- LLM re-labels the full dataset with updated guidelines and more few-shot examples
- Accuracy is recomputed against all human labels
- New confusion patterns are identified
- Guidelines are refined again
This phase is mostly automatic. You only need to approve guideline changes.
Typical result: Accuracy improves by 2-4% per refinement round.
Step 11: Phase 8 -- Disagreement Exploration
Potato presents the most contentious instances: cases where the LLM, labeling functions, and nearest-neighbor analysis all give different answers. For each instance, you see:
- The review text
- LLM prediction and confidence
- Labeling function votes
- 3 nearest labeled examples with their labels
- The LLM's chain-of-thought reasoning
These are genuinely hard cases. Your labels here have the highest marginal value of any annotation in the entire process.
Time estimate: 20-30 minutes for 100-150 instances.
Step 12: Phase 9 -- Edge Case Synthesis
Potato generates synthetic reviews targeting the remaining confusion patterns. For example, if the LLM still struggles with "neutral reviews that mention buying again," it generates examples like:
"It's an okay product for the price. I might get another one if there's a sale."
You label these synthetic examples, and they are added to the LLM's few-shot context.
Time estimate: 10-15 minutes for 30 examples.
Step 13: Phase 10 -- Cascaded Confidence Escalation
The LLM has now labeled most of the dataset. Potato ranks all LLM-labeled instances by confidence and sends the lowest-confidence ones to you in batches of 25.
Confidence Escalation Progress:
Batch 1: 25 instances, 23/25 correct (92%)
Batch 2: 25 instances, 24/25 correct (96%)
Batch 3: 25 instances, 25/25 correct (100%)
-> Stopping: last 3 batches stable
Once you see three consecutive batches where the LLM got everything right, Solo Mode concludes that the remaining high-confidence labels are trustworthy.
Time estimate: 15-20 minutes.
Step 14: Phase 11 -- Prompt Optimization
This phase runs automatically. Potato tries 8 prompt variants and selects the one with the highest F1 score on your accumulated human labels:
Prompt Optimization Results:
Variant 1 (direct, 5 examples): F1=0.91
Variant 2 (CoT, 5 examples): F1=0.93
Variant 3 (direct, 10 examples): F1=0.92
Variant 4 (CoT, 10 examples): F1=0.94 <-- selected
Variant 5 (direct, 15 examples): F1=0.92
Variant 6 (CoT, 15 examples): F1=0.93
Variant 7 (self-consistency, 5x): F1=0.94
Variant 8 (self-consistency, 10x): F1=0.94
The best prompt is used for a final re-labeling pass.
Step 15: Phase 12 -- Final Validation
Potato selects 100 random LLM-labeled instances for you to review. You label them, and Potato compares against the LLM's labels.
Final Validation:
Reviewed: 100 instances
LLM correct: 94/100 (94%)
Threshold: 93%
-> PASSED
If the LLM's accuracy meets your threshold, the dataset is complete. If not, Solo Mode cycles back to Phase 6 for another round of active labeling.
Time estimate: 10-15 minutes.
Results Summary
After running through all 12 phases, check the final statistics:
python -m potato.solo status --config config.yamlSolo Mode Complete
==================
Dataset: 10,000 instances
Total human labels: 612
Seed: 50
Active labeling: 275
Disagreement exploration: 137
Edge case synthesis: 30
Confidence escalation: 75
Final validation: 45
LLM labels: 8,200 (accuracy: 94.1%)
LF labels: 1,800 (precision: 95.3%)
Unlabeled: 0
Final label distribution:
Positive: 4,823 (48.2%)
Neutral: 3,011 (30.1%)
Negative: 2,166 (21.7%)
Total human time: ~3.5 hours
Estimated multi-annotator cost (3x): ~$4,500
Solo Mode cost: ~$450 (API fees) + ~$175 (annotator time)
Savings: ~88%
The human labeled 612 out of 10,000 instances (6.1%). The LLM and labeling functions handled the rest at 94%+ accuracy.
Exporting Results
Export the final labeled dataset:
python -m potato.solo export --config config.yaml --output final_labels.jsonlEach line includes the label and its source:
{"id": "rev_001", "sentiment": "Positive", "source": "human", "confidence": 1.0}
{"id": "rev_002", "sentiment": "Neutral", "source": "llm", "confidence": 0.91}
{"id": "rev_003", "sentiment": "Negative", "source": "labeling_function", "confidence": 0.97}For Parquet export:
import pandas as pd
df = pd.read_parquet("output/parquet/annotations.parquet")
print(df["value"].value_counts())Quality Assurance: Hybrid Verification
For publication-quality datasets, add a second annotator to review a sample:
solo_mode:
verification:
enabled: true
sample_fraction: 0.10
annotator: "reviewer_1"This assigns 1,000 random instances to a second annotator. You can then compute inter-annotator agreement between the Solo Mode labels and the reviewer's labels.
Troubleshooting
LLM accuracy plateaus below threshold
- Increase seed count: Try 75-100 seed instances instead of 50
- Switch LLM: Try
claude-sonnet-4-20250514instead of GPT-4o (or vice versa) - Lower the threshold: If 93% is not achievable, consider whether 90% is acceptable for your use case
- Check your data: Some datasets are inherently ambiguous. If human-human agreement would only be 90%, do not expect the LLM to do better
Phase 6 takes too many batches
- Increase batch size: Change
batch_sizefrom 25 to 50 - Adjust pool weights: If most escalated instances are from the "uncertain" pool, reduce its weight and increase "disagreement" and "error_pattern"
Labeling functions have low coverage
- This is normal for tasks without strong lexical signals (e.g., sarcasm detection, implicit sentiment)
- Labeling functions work best for explicit, keyword-driven patterns
- Solo Mode still works without labeling functions -- the LLM picks up the slack
Further Reading
- Solo Mode Documentation -- full configuration reference
- Active Learning -- the underlying selection algorithm
- AI Support -- LLM provider configuration
- Quality Control -- additional quality assurance options
- Parquet Export -- efficient data export