Skip to content
Docs/Features

Solo Mode

Label entire datasets with a single annotator collaborating with an LLM through a 12-phase intelligent workflow.

Solo Mode

New in v2.3.0

Traditional annotation projects require multiple annotators, inter-annotator agreement computation, adjudication rounds, and significant coordination overhead. For many research teams, this is the primary bottleneck: not the annotation interface, but the logistics of hiring, training, and managing a team.

Solo Mode replaces the multi-annotator paradigm with a single human expert collaborating with an LLM. The human provides high-quality labels on a small, strategically selected subset. The LLM learns from those labels, proposes labels for the rest, and the human reviews only the cases where the LLM is uncertain or likely wrong. A 12-phase workflow orchestrates this process automatically.

In internal benchmarks, Solo Mode achieved 95%+ agreement with full multi-annotator pipelines while requiring only 10-15% of the total human labels.

The 12-Phase Workflow

Solo Mode progresses through 12 phases. The system advances automatically based on configurable thresholds, though you can also trigger transitions manually from the admin dashboard.

Phase 1: Seed Annotation

The human annotator labels an initial seed set. Potato selects diverse, representative instances using embedding-based clustering to maximize coverage of the data distribution.

Default seed size: 50 instances (configurable via seed_count)

Phase 2: Initial LLM Calibration

The LLM receives the seed annotations as few-shot examples and labels a calibration batch. Potato compares LLM predictions against held-out seed labels to establish a baseline accuracy.

Phase 3: Confusion Analysis

Potato identifies systematic disagreement patterns between human and LLM. It builds a confusion matrix and surfaces the most common error types (e.g., "LLM labels neutral as positive 40% of the time").

Phase 4: Guideline Refinement

Based on confusion analysis, Potato generates refined annotation guidelines for the LLM. The human reviews and edits these guidelines before they are applied. This is an interactive step where the annotator can add examples, clarify edge cases, and adjust label definitions.

Phase 5: Labeling Function Generation

Inspired by the ALCHEmist framework, Potato generates programmatic labeling functions from the existing annotations. These are simple pattern-based rules (e.g., "if the text contains 'excellent' and no negation, label as positive") that can label easy instances with high precision, reserving human and LLM effort for harder cases.

Phase 6: Active Labeling

The human labels additional instances selected by active learning. Potato prioritizes instances where the LLM is most uncertain, where labeling functions disagree, or where the instance is far from existing training examples in embedding space.

Phase 7: Automated Refinement Loop

The LLM re-labels the full dataset with updated guidelines and few-shot examples. Potato compares against all human labels and triggers another cycle of confusion analysis and guideline refinement if accuracy is below the threshold.

Phase 8: Disagreement Exploration

The human reviews all instances where the LLM and labeling functions disagree. These are typically the most informative and difficult examples. The human's labels on these cases provide the highest marginal value.

Phase 9: Edge Case Synthesis

Potato uses the LLM to generate synthetic edge cases based on the identified confusion patterns. The human labels these synthetic examples, which are then added to the LLM's training context to improve performance on the hardest cases.

Phase 10: Cascaded Confidence Escalation

The LLM assigns confidence scores to every remaining unlabeled instance. Instances are escalated to the human in descending order of difficulty (ascending confidence). The human labels until quality metrics stabilize.

Phase 11: Prompt Optimization

Inspired by DSPy, Potato runs automated prompt optimization using the accumulated human labels as a validation set. It tries multiple prompt variations (instruction phrasing, example ordering, chain-of-thought vs. direct) and selects the best-performing prompt.

Phase 12: Final Validation

The human performs a final review of a random sample from the LLM-labeled instances. If accuracy meets the threshold, the dataset is complete. If not, the system cycles back to Phase 6.


Configuration

Quick Start

A minimal Solo Mode configuration:

yaml
task_name: "Sentiment Classification"
task_dir: "."
 
data_files:
  - "data/reviews.jsonl"
 
item_properties:
  id_key: id
  text_key: text
 
solo_mode:
  enabled: true
 
  # LLM provider
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
 
  # Basic thresholds
  seed_count: 50
  accuracy_threshold: 0.92
  confidence_threshold: 0.85
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels:
      - Positive
      - Neutral
      - Negative
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

Full Configuration Reference

yaml
solo_mode:
  enabled: true
 
  # LLM configuration
  llm:
    endpoint_type: openai        # openai, anthropic, ollama, vllm
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    temperature: 0.1             # low temperature for consistency
    max_tokens: 256
 
  # Phase control
  phases:
    seed:
      count: 50                  # number of seed instances
      selection: diversity        # diversity, random, or stratified
      embedding_model: "all-MiniLM-L6-v2"
 
    calibration:
      batch_size: 100
      holdout_fraction: 0.2      # fraction of seed used for validation
 
    confusion_analysis:
      min_samples: 30
      significance_threshold: 0.05
 
    guideline_refinement:
      auto_suggest: true         # LLM suggests guideline edits
      require_approval: true     # human must approve changes
 
    labeling_functions:
      enabled: true
      max_functions: 20
      min_precision: 0.90        # only keep high-precision rules
      min_coverage: 0.01         # must cover at least 1% of data
 
    active_labeling:
      batch_size: 25
      strategy: uncertainty       # uncertainty, diversity, or hybrid
      max_batches: 10
 
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02
 
    disagreement_exploration:
      max_instances: 200
      sort_by: confidence_gap
 
    edge_case_synthesis:
      enabled: true
      count: 50
      diversity_weight: 0.3
 
    confidence_escalation:
      escalation_budget: 200     # max instances to escalate
      batch_size: 25
      stop_when_stable: true     # stop if last batch accuracy is 100%
 
    prompt_optimization:
      enabled: true
      candidates: 10             # number of prompt variants to try
      metric: f1_macro
      search_strategy: bayesian  # bayesian, grid, or random
 
    final_validation:
      sample_size: 100
      min_accuracy: 0.92
      fallback_phase: 6          # go back to Phase 6 if validation fails
 
  # Instance prioritization across phases
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
        description: "LLM confidence below threshold"
      - name: disagreement
        weight: 0.25
        description: "LLM and labeling functions disagree"
      - name: boundary
        weight: 0.20
        description: "Near decision boundary in embedding space"
      - name: novel
        weight: 0.10
        description: "Far from all existing labeled examples"
      - name: error_pattern
        weight: 0.10
        description: "Matches known confusion patterns"
      - name: random
        weight: 0.05
        description: "Random sample for calibration"

Key Capabilities

Confusion Analysis

After each labeling round, Potato builds a confusion matrix between human and LLM labels. The admin dashboard shows:

  • Per-class precision, recall, and F1 from the LLM's perspective
  • Most common confusion pairs (e.g., "neutral misclassified as positive: 23 instances")
  • Example instances for each confusion pair
  • Trend charts showing improvement across refinement rounds

Access confusion analysis programmatically:

bash
python -m potato.solo confusion --config config.yaml

Output:

text
Confusion Analysis (Round 2)
============================
Overall Accuracy: 0.87 (target: 0.92)

Top Confusion Pairs:
  neutral -> positive:  23 instances (15.3%)
  negative -> neutral:  11 instances (7.3%)
  positive -> neutral:   5 instances (3.3%)

Per-Class Performance:
  Positive:  P=0.91  R=0.94  F1=0.92
  Neutral:   P=0.78  R=0.71  F1=0.74
  Negative:  P=0.93  R=0.88  F1=0.90

Automated Refinement Loop

The refinement loop iterates between LLM labeling, confusion analysis, and guideline updates. Each iteration:

  1. LLM labels the full dataset with current guidelines
  2. Potato compares against all available human labels
  3. If accuracy is below threshold, confusion analysis runs
  4. LLM proposes guideline edits based on error patterns
  5. Human reviews and approves edits
  6. Cycle repeats (up to max_iterations)
yaml
solo_mode:
  llm:
    endpoint_type: anthropic
    model: "claude-sonnet-4-20250514"
    api_key: ${ANTHROPIC_API_KEY}
 
  phases:
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02    # stop if improvement is less than 2%

Labeling Functions (ALCHEmist-Inspired)

Potato generates lightweight labeling functions from patterns observed in human annotations. These are not LLM calls; they are fast, deterministic rules.

Example generated labeling functions:

python
# Auto-generated labeling function 1
# Precision: 0.96, Coverage: 0.08
def lf_strong_positive_words(text):
    positive = {"excellent", "amazing", "fantastic", "outstanding", "perfect"}
    if any(w in text.lower() for w in positive):
        if not any(neg in text.lower() for neg in {"not", "never", "no"}):
            return "Positive"
    return None  # abstain
 
# Auto-generated labeling function 2
# Precision: 0.93, Coverage: 0.05
def lf_explicit_negative(text):
    negative = {"terrible", "awful", "horrible", "worst", "disgusting"}
    if any(w in text.lower() for w in negative):
        return "Negative"
    return None

Configure labeling function behavior:

yaml
solo_mode:
  phases:
    labeling_functions:
      enabled: true
      max_functions: 20
      min_precision: 0.90
      min_coverage: 0.01
      types:
        - keyword_match
        - regex_pattern
        - length_threshold
        - embedding_cluster

Disagreement Explorer

The disagreement explorer presents instances where different signals conflict. For each instance, the annotator sees:

  • The LLM's predicted label and confidence
  • Labeling function votes (if any)
  • Nearest labeled neighbors in embedding space
  • The raw text/content

This is the highest-value annotation activity: each label resolves a genuine ambiguity.

yaml
solo_mode:
  phases:
    disagreement_exploration:
      max_instances: 200
      sort_by: confidence_gap     # or "lf_disagreement" or "random"
      show_llm_reasoning: true    # display LLM's chain-of-thought
      show_nearest_neighbors: 3   # show 3 nearest labeled examples

Cascaded Confidence Escalation

After the bulk of the dataset is labeled by the LLM, Potato ranks all LLM-labeled instances by confidence and escalates the least confident ones to the human. This continues in batches until quality stabilizes.

yaml
solo_mode:
  phases:
    confidence_escalation:
      escalation_budget: 200
      batch_size: 25
      stop_when_stable: true
      stability_window: 3        # stop if last 3 batches are all correct

Multi-Signal Instance Prioritization

Across all phases that involve human labeling, Potato uses a weighted pool system to select the most informative instances. Six pools feed into a unified priority queue:

yaml
solo_mode:
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05
  • uncertain: Instances where the LLM's confidence is below confidence_threshold
  • disagreement: Instances where the LLM and labeling functions produce different labels
  • boundary: Instances near the decision boundary in embedding space
  • novel: Instances far from any existing labeled example
  • error_pattern: Instances matching known confusion patterns from previous rounds
  • random: A small random sample to maintain calibration and catch blind spots

Edge Case Synthesis

Potato uses the LLM to generate synthetic examples that target known weaknesses:

yaml
solo_mode:
  phases:
    edge_case_synthesis:
      enabled: true
      count: 50
      diversity_weight: 0.3
      confusion_pairs:            # focus on these error types
        - ["neutral", "positive"]
        - ["negative", "neutral"]

The LLM generates examples that are ambiguous between the specified label pairs. The human labels them, and these labels are added to the few-shot context for subsequent LLM labeling rounds.

Prompt Optimization (DSPy-Inspired)

In Phase 11, Potato runs automated prompt optimization to find the best instruction format for the LLM:

yaml
solo_mode:
  phases:
    prompt_optimization:
      enabled: true
      candidates: 10
      metric: f1_macro
      search_strategy: bayesian
      variations:
        - instruction_style      # formal vs. conversational
        - example_ordering       # random, by-class, by-difficulty
        - reasoning_mode         # direct, chain-of-thought, self-consistency
        - example_count          # 3, 5, 10, 15 few-shot examples

Monitoring Progress

The admin dashboard shows Solo Mode progress in real time:

  • Current phase and progress within each phase
  • Human labels completed vs. total budget
  • LLM accuracy over time (per round)
  • Labeling function coverage and precision
  • Confidence distribution histogram
  • Estimated time to completion

Access from the command line:

bash
python -m potato.solo status --config config.yaml
text
Solo Mode Status
================
Current Phase: 6 (Active Labeling) - Batch 3/10
Human Labels: 142 / ~300 estimated total
LLM Accuracy: 0.89 (target: 0.92)
LF Coverage: 0.23 (labeling functions cover 23% of data)
Dataset Size: 10,000 instances
  - Human labeled: 142
  - LF labeled: 2,300
  - LLM labeled: 7,558
  - Unlabeled: 0

When to Use Solo Mode vs. Traditional Multi-Annotator

Use Solo Mode when:

  • You have a domain expert who can provide high-quality labels
  • Budget or logistics prevent hiring multiple annotators
  • The task has clear, well-defined categories
  • You need to label a large dataset (1,000+ instances)
  • Speed matters more than measuring inter-annotator agreement

Use traditional multi-annotator when:

  • You need inter-annotator agreement statistics for publication
  • The task is highly subjective (e.g., offensiveness, humor)
  • You need to study annotator disagreement patterns
  • Regulatory requirements mandate multiple independent annotators
  • The label space is complex or evolving (annotation guidelines are still being developed)

Hybrid approach: Use Solo Mode for the initial bulk labeling, then assign a second annotator to a random 10-20% sample to compute agreement statistics. This gives you the efficiency of Solo Mode with the quality assurance of multi-annotator verification.

yaml
solo_mode:
  enabled: true
  # ... solo mode config ...
 
  # Hybrid: assign verification sample to second annotator
  verification:
    enabled: true
    sample_fraction: 0.15
    annotator: "reviewer_1"

Further Reading

For implementation details, see the source documentation.