Solo Mode
Label entire datasets with a single annotator collaborating with an LLM through a 12-phase intelligent workflow.
Solo Mode
New in v2.3.0
Traditional annotation projects require multiple annotators, inter-annotator agreement computation, adjudication rounds, and significant coordination overhead. For many research teams, this is the primary bottleneck: not the annotation interface, but the logistics of hiring, training, and managing a team.
Solo Mode replaces the multi-annotator paradigm with a single human expert collaborating with an LLM. The human provides high-quality labels on a small, strategically selected subset. The LLM learns from those labels, proposes labels for the rest, and the human reviews only the cases where the LLM is uncertain or likely wrong. A 12-phase workflow orchestrates this process automatically.
In internal benchmarks, Solo Mode achieved 95%+ agreement with full multi-annotator pipelines while requiring only 10-15% of the total human labels.
The 12-Phase Workflow
Solo Mode progresses through 12 phases. The system advances automatically based on configurable thresholds, though you can also trigger transitions manually from the admin dashboard.
Phase 1: Seed Annotation
The human annotator labels an initial seed set. Potato selects diverse, representative instances using embedding-based clustering to maximize coverage of the data distribution.
Default seed size: 50 instances (configurable via seed_count)
Phase 2: Initial LLM Calibration
The LLM receives the seed annotations as few-shot examples and labels a calibration batch. Potato compares LLM predictions against held-out seed labels to establish a baseline accuracy.
Phase 3: Confusion Analysis
Potato identifies systematic disagreement patterns between human and LLM. It builds a confusion matrix and surfaces the most common error types (e.g., "LLM labels neutral as positive 40% of the time").
Phase 4: Guideline Refinement
Based on confusion analysis, Potato generates refined annotation guidelines for the LLM. The human reviews and edits these guidelines before they are applied. This is an interactive step where the annotator can add examples, clarify edge cases, and adjust label definitions.
Phase 5: Labeling Function Generation
Inspired by the ALCHEmist framework, Potato generates programmatic labeling functions from the existing annotations. These are simple pattern-based rules (e.g., "if the text contains 'excellent' and no negation, label as positive") that can label easy instances with high precision, reserving human and LLM effort for harder cases.
Phase 6: Active Labeling
The human labels additional instances selected by active learning. Potato prioritizes instances where the LLM is most uncertain, where labeling functions disagree, or where the instance is far from existing training examples in embedding space.
Phase 7: Automated Refinement Loop
The LLM re-labels the full dataset with updated guidelines and few-shot examples. Potato compares against all human labels and triggers another cycle of confusion analysis and guideline refinement if accuracy is below the threshold.
Phase 8: Disagreement Exploration
The human reviews all instances where the LLM and labeling functions disagree. These are typically the most informative and difficult examples. The human's labels on these cases provide the highest marginal value.
Phase 9: Edge Case Synthesis
Potato uses the LLM to generate synthetic edge cases based on the identified confusion patterns. The human labels these synthetic examples, which are then added to the LLM's training context to improve performance on the hardest cases.
Phase 10: Cascaded Confidence Escalation
The LLM assigns confidence scores to every remaining unlabeled instance. Instances are escalated to the human in descending order of difficulty (ascending confidence). The human labels until quality metrics stabilize.
Phase 11: Prompt Optimization
Inspired by DSPy, Potato runs automated prompt optimization using the accumulated human labels as a validation set. It tries multiple prompt variations (instruction phrasing, example ordering, chain-of-thought vs. direct) and selects the best-performing prompt.
Phase 12: Final Validation
The human performs a final review of a random sample from the LLM-labeled instances. If accuracy meets the threshold, the dataset is complete. If not, the system cycles back to Phase 6.
Configuration
Quick Start
A minimal Solo Mode configuration:
task_name: "Sentiment Classification"
task_dir: "."
data_files:
- "data/reviews.jsonl"
item_properties:
id_key: id
text_key: text
solo_mode:
enabled: true
# LLM provider
llm:
endpoint_type: openai
model: "gpt-4o"
api_key: ${OPENAI_API_KEY}
# Basic thresholds
seed_count: 50
accuracy_threshold: 0.92
confidence_threshold: 0.85
annotation_schemes:
- annotation_type: radio
name: sentiment
labels:
- Positive
- Neutral
- Negative
output_annotation_dir: "output/"
output_annotation_format: "jsonl"Full Configuration Reference
solo_mode:
enabled: true
# LLM configuration
llm:
endpoint_type: openai # openai, anthropic, ollama, vllm
model: "gpt-4o"
api_key: ${OPENAI_API_KEY}
temperature: 0.1 # low temperature for consistency
max_tokens: 256
# Phase control
phases:
seed:
count: 50 # number of seed instances
selection: diversity # diversity, random, or stratified
embedding_model: "all-MiniLM-L6-v2"
calibration:
batch_size: 100
holdout_fraction: 0.2 # fraction of seed used for validation
confusion_analysis:
min_samples: 30
significance_threshold: 0.05
guideline_refinement:
auto_suggest: true # LLM suggests guideline edits
require_approval: true # human must approve changes
labeling_functions:
enabled: true
max_functions: 20
min_precision: 0.90 # only keep high-precision rules
min_coverage: 0.01 # must cover at least 1% of data
active_labeling:
batch_size: 25
strategy: uncertainty # uncertainty, diversity, or hybrid
max_batches: 10
refinement_loop:
max_iterations: 3
improvement_threshold: 0.02
disagreement_exploration:
max_instances: 200
sort_by: confidence_gap
edge_case_synthesis:
enabled: true
count: 50
diversity_weight: 0.3
confidence_escalation:
escalation_budget: 200 # max instances to escalate
batch_size: 25
stop_when_stable: true # stop if last batch accuracy is 100%
prompt_optimization:
enabled: true
candidates: 10 # number of prompt variants to try
metric: f1_macro
search_strategy: bayesian # bayesian, grid, or random
final_validation:
sample_size: 100
min_accuracy: 0.92
fallback_phase: 6 # go back to Phase 6 if validation fails
# Instance prioritization across phases
prioritization:
pools:
- name: uncertain
weight: 0.30
description: "LLM confidence below threshold"
- name: disagreement
weight: 0.25
description: "LLM and labeling functions disagree"
- name: boundary
weight: 0.20
description: "Near decision boundary in embedding space"
- name: novel
weight: 0.10
description: "Far from all existing labeled examples"
- name: error_pattern
weight: 0.10
description: "Matches known confusion patterns"
- name: random
weight: 0.05
description: "Random sample for calibration"Key Capabilities
Confusion Analysis
After each labeling round, Potato builds a confusion matrix between human and LLM labels. The admin dashboard shows:
- Per-class precision, recall, and F1 from the LLM's perspective
- Most common confusion pairs (e.g., "neutral misclassified as positive: 23 instances")
- Example instances for each confusion pair
- Trend charts showing improvement across refinement rounds
Access confusion analysis programmatically:
python -m potato.solo confusion --config config.yamlOutput:
Confusion Analysis (Round 2)
============================
Overall Accuracy: 0.87 (target: 0.92)
Top Confusion Pairs:
neutral -> positive: 23 instances (15.3%)
negative -> neutral: 11 instances (7.3%)
positive -> neutral: 5 instances (3.3%)
Per-Class Performance:
Positive: P=0.91 R=0.94 F1=0.92
Neutral: P=0.78 R=0.71 F1=0.74
Negative: P=0.93 R=0.88 F1=0.90
Automated Refinement Loop
The refinement loop iterates between LLM labeling, confusion analysis, and guideline updates. Each iteration:
- LLM labels the full dataset with current guidelines
- Potato compares against all available human labels
- If accuracy is below threshold, confusion analysis runs
- LLM proposes guideline edits based on error patterns
- Human reviews and approves edits
- Cycle repeats (up to
max_iterations)
solo_mode:
llm:
endpoint_type: anthropic
model: "claude-sonnet-4-20250514"
api_key: ${ANTHROPIC_API_KEY}
phases:
refinement_loop:
max_iterations: 3
improvement_threshold: 0.02 # stop if improvement is less than 2%Labeling Functions (ALCHEmist-Inspired)
Potato generates lightweight labeling functions from patterns observed in human annotations. These are not LLM calls; they are fast, deterministic rules.
Example generated labeling functions:
# Auto-generated labeling function 1
# Precision: 0.96, Coverage: 0.08
def lf_strong_positive_words(text):
positive = {"excellent", "amazing", "fantastic", "outstanding", "perfect"}
if any(w in text.lower() for w in positive):
if not any(neg in text.lower() for neg in {"not", "never", "no"}):
return "Positive"
return None # abstain
# Auto-generated labeling function 2
# Precision: 0.93, Coverage: 0.05
def lf_explicit_negative(text):
negative = {"terrible", "awful", "horrible", "worst", "disgusting"}
if any(w in text.lower() for w in negative):
return "Negative"
return NoneConfigure labeling function behavior:
solo_mode:
phases:
labeling_functions:
enabled: true
max_functions: 20
min_precision: 0.90
min_coverage: 0.01
types:
- keyword_match
- regex_pattern
- length_threshold
- embedding_clusterDisagreement Explorer
The disagreement explorer presents instances where different signals conflict. For each instance, the annotator sees:
- The LLM's predicted label and confidence
- Labeling function votes (if any)
- Nearest labeled neighbors in embedding space
- The raw text/content
This is the highest-value annotation activity: each label resolves a genuine ambiguity.
solo_mode:
phases:
disagreement_exploration:
max_instances: 200
sort_by: confidence_gap # or "lf_disagreement" or "random"
show_llm_reasoning: true # display LLM's chain-of-thought
show_nearest_neighbors: 3 # show 3 nearest labeled examplesCascaded Confidence Escalation
After the bulk of the dataset is labeled by the LLM, Potato ranks all LLM-labeled instances by confidence and escalates the least confident ones to the human. This continues in batches until quality stabilizes.
solo_mode:
phases:
confidence_escalation:
escalation_budget: 200
batch_size: 25
stop_when_stable: true
stability_window: 3 # stop if last 3 batches are all correctMulti-Signal Instance Prioritization
Across all phases that involve human labeling, Potato uses a weighted pool system to select the most informative instances. Six pools feed into a unified priority queue:
solo_mode:
prioritization:
pools:
- name: uncertain
weight: 0.30
- name: disagreement
weight: 0.25
- name: boundary
weight: 0.20
- name: novel
weight: 0.10
- name: error_pattern
weight: 0.10
- name: random
weight: 0.05- uncertain: Instances where the LLM's confidence is below
confidence_threshold - disagreement: Instances where the LLM and labeling functions produce different labels
- boundary: Instances near the decision boundary in embedding space
- novel: Instances far from any existing labeled example
- error_pattern: Instances matching known confusion patterns from previous rounds
- random: A small random sample to maintain calibration and catch blind spots
Edge Case Synthesis
Potato uses the LLM to generate synthetic examples that target known weaknesses:
solo_mode:
phases:
edge_case_synthesis:
enabled: true
count: 50
diversity_weight: 0.3
confusion_pairs: # focus on these error types
- ["neutral", "positive"]
- ["negative", "neutral"]The LLM generates examples that are ambiguous between the specified label pairs. The human labels them, and these labels are added to the few-shot context for subsequent LLM labeling rounds.
Prompt Optimization (DSPy-Inspired)
In Phase 11, Potato runs automated prompt optimization to find the best instruction format for the LLM:
solo_mode:
phases:
prompt_optimization:
enabled: true
candidates: 10
metric: f1_macro
search_strategy: bayesian
variations:
- instruction_style # formal vs. conversational
- example_ordering # random, by-class, by-difficulty
- reasoning_mode # direct, chain-of-thought, self-consistency
- example_count # 3, 5, 10, 15 few-shot examplesMonitoring Progress
The admin dashboard shows Solo Mode progress in real time:
- Current phase and progress within each phase
- Human labels completed vs. total budget
- LLM accuracy over time (per round)
- Labeling function coverage and precision
- Confidence distribution histogram
- Estimated time to completion
Access from the command line:
python -m potato.solo status --config config.yamlSolo Mode Status
================
Current Phase: 6 (Active Labeling) - Batch 3/10
Human Labels: 142 / ~300 estimated total
LLM Accuracy: 0.89 (target: 0.92)
LF Coverage: 0.23 (labeling functions cover 23% of data)
Dataset Size: 10,000 instances
- Human labeled: 142
- LF labeled: 2,300
- LLM labeled: 7,558
- Unlabeled: 0
When to Use Solo Mode vs. Traditional Multi-Annotator
Use Solo Mode when:
- You have a domain expert who can provide high-quality labels
- Budget or logistics prevent hiring multiple annotators
- The task has clear, well-defined categories
- You need to label a large dataset (1,000+ instances)
- Speed matters more than measuring inter-annotator agreement
Use traditional multi-annotator when:
- You need inter-annotator agreement statistics for publication
- The task is highly subjective (e.g., offensiveness, humor)
- You need to study annotator disagreement patterns
- Regulatory requirements mandate multiple independent annotators
- The label space is complex or evolving (annotation guidelines are still being developed)
Hybrid approach: Use Solo Mode for the initial bulk labeling, then assign a second annotator to a random 10-20% sample to compute agreement statistics. This gives you the efficiency of Solo Mode with the quality assurance of multi-annotator verification.
solo_mode:
enabled: true
# ... solo mode config ...
# Hybrid: assign verification sample to second annotator
verification:
enabled: true
sample_fraction: 0.15
annotator: "reviewer_1"Further Reading
- Solo Mode Tutorial: Labeling 10,000 Examples -- step-by-step walkthrough
- Active Learning -- the underlying active learning system
- AI Support -- LLM integration configuration
- Quality Control -- quality assurance for annotations
- MACE -- competence estimation (useful for hybrid mode verification)
For implementation details, see the source documentation.