Solo Mode

Run a complete annotation pipeline solo in Potato — a 12-phase LLM-human workflow covering seeding, labeling, adjudication, and refinement without a full annotation team.

New in v2.3.0

Traditional annotation projects require multiple annotators, inter-annotator agreement computation, adjudication rounds, and significant coordination overhead. For many research teams, this is the primary bottleneck: not the annotation interface, but the logistics of hiring, training, and managing a team.

Solo Mode replaces the multi-annotator paradigm with a single human expert collaborating with an LLM. The human provides high-quality labels on a small, strategically selected subset. The LLM learns from those labels, proposes labels for the rest, and the human reviews only the cases where the LLM is uncertain or likely wrong. A 12-phase workflow orchestrates this process automatically.

In internal benchmarks, Solo Mode achieved 95%+ agreement with full multi-annotator pipelines while requiring only 10-15% of the total human labels.

The 12-Phase Workflow

Solo Mode progresses through 12 phases. The system advances automatically based on configurable thresholds, though you can also trigger transitions manually from the admin dashboard.

Phase 1: Seed Annotation

The human annotator labels an initial seed set. Potato selects diverse, representative instances using embedding-based clustering to maximize coverage of the data distribution.

Default seed size: 50 instances (configurable via seed_count)

Phase 2: Initial LLM Calibration

The LLM receives the seed annotations as few-shot examples and labels a calibration batch. Potato compares LLM predictions against held-out seed labels to establish a baseline accuracy.

Phase 3: Confusion Analysis

Potato identifies systematic disagreement patterns between human and LLM. It builds a confusion matrix and surfaces the most common error types (e.g., "LLM labels neutral as positive 40% of the time").

Phase 4: Guideline Refinement

Based on confusion analysis, Potato generates refined annotation guidelines for the LLM. The human reviews and edits these guidelines before they are applied. This is an interactive step where the annotator can add examples, clarify edge cases, and adjust label definitions.

Phase 5: Labeling Function Generation

Inspired by the ALCHEmist framework, Potato generates programmatic labeling functions from the existing annotations. These are simple pattern-based rules (e.g., "if the text contains 'excellent' and no negation, label as positive") that can label easy instances with high precision, reserving human and LLM effort for harder cases.

Phase 6: Active Labeling

The human labels additional instances selected by active learning. Potato prioritizes instances where the LLM is most uncertain, where labeling functions disagree, or where the instance is far from existing training examples in embedding space.

Phase 7: Automated Refinement Loop

The LLM re-labels the full dataset with updated guidelines and few-shot examples. Potato compares against all human labels and triggers another cycle of confusion analysis and guideline refinement if accuracy is below the threshold.

Phase 8: Disagreement Exploration

The human reviews all instances where the LLM and labeling functions disagree. These are typically the most informative and difficult examples. The human's labels on these cases provide the highest marginal value.

Phase 9: Edge Case Synthesis

Potato uses the LLM to generate synthetic edge cases based on the identified confusion patterns. The human labels these synthetic examples, which are then added to the LLM's training context to improve performance on the hardest cases.

Phase 10: Cascaded Confidence Escalation

The LLM assigns confidence scores to every remaining unlabeled instance. Instances are escalated to the human in descending order of difficulty (ascending confidence). The human labels until quality metrics stabilize.

Phase 11: Prompt Optimization

Inspired by DSPy, Potato runs automated prompt optimization using the accumulated human labels as a validation set. It tries multiple prompt variations (instruction phrasing, example ordering, chain-of-thought vs. direct) and selects the best-performing prompt.

Phase 12: Final Validation

The human performs a final review of a random sample from the LLM-labeled instances. If accuracy meets the threshold, the dataset is complete. If not, the system cycles back to Phase 6.

Configuration

Quick Start

A minimal Solo Mode configuration:

yaml

task_name: "Sentiment Classification"
task_dir: "."
 
data_files:
  - "data/reviews.jsonl"
 
item_properties:
  id_key: id
  text_key: text
 
solo_mode:
  enabled: true
 
  # LLM provider
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
 
  # Basic thresholds
  seed_count: 50
  accuracy_threshold: 0.92
  confidence_threshold: 0.85
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels:
      - Positive
      - Neutral
      - Negative
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

Full Configuration Reference

yaml

solo_mode:
  enabled: true
 
  # LLM configuration
  llm:
    endpoint_type: openai        # openai, anthropic, ollama, vllm
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    temperature: 0.1             # low temperature for consistency
    max_tokens: 256
 
  # Phase control
  phases:
    seed:
      count: 50                  # number of seed instances
      selection: diversity        # diversity, random, or stratified
      embedding_model: "all-MiniLM-L6-v2"
 
    calibration:
      batch_size: 100
      holdout_fraction: 0.2      # fraction of seed used for validation
 
    confusion_analysis:
      min_samples: 30
      significance_threshold: 0.05
 
    guideline_refinement:
      auto_suggest: true         # LLM suggests guideline edits
      require_approval: true     # human must approve changes
 
    labeling_functions:
      enabled: true
      max_functions: 20
      min_precision: 0.90        # only keep high-precision rules
      min_coverage: 0.01         # must cover at least 1% of data
 
    active_labeling:
      batch_size: 25
      strategy: uncertainty       # uncertainty, diversity, or hybrid
      max_batches: 10
 
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02
 
    disagreement_exploration:
      max_instances: 200
      sort_by: confidence_gap
 
    edge_case_synthesis:
      enabled: true
      count: 50
      diversity_weight: 0.3
 
    confidence_escalation:
      escalation_budget: 200     # max instances to escalate
      batch_size: 25
      stop_when_stable: true     # stop if last batch accuracy is 100%
 
    prompt_optimization:
      enabled: true
      candidates: 10             # number of prompt variants to try
      metric: f1_macro
      search_strategy: bayesian  # bayesian, grid, or random
 
    final_validation:
      sample_size: 100
      min_accuracy: 0.92
      fallback_phase: 6          # go back to Phase 6 if validation fails
 
  # Instance prioritization across phases
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
        description: "LLM confidence below threshold"
      - name: disagreement
        weight: 0.25
        description: "LLM and labeling functions disagree"
      - name: boundary
        weight: 0.20
        description: "Near decision boundary in embedding space"
      - name: novel
        weight: 0.10
        description: "Far from all existing labeled examples"
      - name: error_pattern
        weight: 0.10
        description: "Matches known confusion patterns"
      - name: random
        weight: 0.05
        description: "Random sample for calibration"

Key Capabilities

Confusion Analysis

After each labeling round, Potato builds a confusion matrix between human and LLM labels. The admin dashboard shows:

Per-class precision, recall, and F1 from the LLM's perspective
Most common confusion pairs (e.g., "neutral misclassified as positive: 23 instances")
Example instances for each confusion pair
Trend charts showing improvement across refinement rounds

Access confusion analysis programmatically:

bash

python -m potato.solo confusion --config config.yaml

Output:

text

Confusion Analysis (Round 2)
============================
Overall Accuracy: 0.87 (target: 0.92)

Top Confusion Pairs:
  neutral -> positive:  23 instances (15.3%)
  negative -> neutral:  11 instances (7.3%)
  positive -> neutral:   5 instances (3.3%)

Per-Class Performance:
  Positive:  P=0.91  R=0.94  F1=0.92
  Neutral:   P=0.78  R=0.71  F1=0.74
  Negative:  P=0.93  R=0.88  F1=0.90

Automated Refinement Loop

The refinement loop iterates between LLM labeling, confusion analysis, and guideline updates. Each iteration:

LLM labels the full dataset with current guidelines
Potato compares against all available human labels
If accuracy is below threshold, confusion analysis runs
LLM proposes guideline edits based on error patterns
Human reviews and approves edits
Cycle repeats (up to max_iterations)

yaml

solo_mode:
  llm:
    endpoint_type: anthropic
    model: "claude-sonnet-4-20250514"
    api_key: ${ANTHROPIC_API_KEY}
 
  phases:
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02    # stop if improvement is less than 2%

Labeling Functions (ALCHEmist-Inspired)

Potato generates lightweight labeling functions from patterns observed in human annotations. These are not LLM calls; they are fast, deterministic rules.

Example generated labeling functions:

python

# Auto-generated labeling function 1
# Precision: 0.96, Coverage: 0.08
def lf_strong_positive_words(text):
    positive = {"excellent", "amazing", "fantastic", "outstanding", "perfect"}
    if any(w in text.lower() for w in positive):
        if not any(neg in text.lower() for neg in {"not", "never", "no"}):
            return "Positive"
    return None  # abstain
 
# Auto-generated labeling function 2
# Precision: 0.93, Coverage: 0.05
def lf_explicit_negative(text):
    negative = {"terrible", "awful", "horrible", "worst", "disgusting"}
    if any(w in text.lower() for w in negative):
        return "Negative"
    return None

Configure labeling function behavior:

yaml

solo_mode:
  phases:
    labeling_functions:
      enabled: true
      max_functions: 20
      min_precision: 0.90
      min_coverage: 0.01
      types:
        - keyword_match
        - regex_pattern
        - length_threshold
        - embedding_cluster

Disagreement Explorer

The disagreement explorer presents instances where different signals conflict. For each instance, the annotator sees:

The LLM's predicted label and confidence
Labeling function votes (if any)
Nearest labeled neighbors in embedding space
The raw text/content

This is the highest-value annotation activity: each label resolves a genuine ambiguity.

yaml

solo_mode:
  phases:
    disagreement_exploration:
      max_instances: 200
      sort_by: confidence_gap     # or "lf_disagreement" or "random"
      show_llm_reasoning: true    # display LLM's chain-of-thought
      show_nearest_neighbors: 3   # show 3 nearest labeled examples

Cascaded Confidence Escalation

After the bulk of the dataset is labeled by the LLM, Potato ranks all LLM-labeled instances by confidence and escalates the least confident ones to the human. This continues in batches until quality stabilizes.

yaml

solo_mode:
  phases:
    confidence_escalation:
      escalation_budget: 200
      batch_size: 25
      stop_when_stable: true
      stability_window: 3        # stop if last 3 batches are all correct

Multi-Signal Instance Prioritization

Across all phases that involve human labeling, Potato uses a weighted pool system to select the most informative instances. Six pools feed into a unified priority queue:

yaml

solo_mode:
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05

uncertain: Instances where the LLM's confidence is below confidence_threshold
disagreement: Instances where the LLM and labeling functions produce different labels
boundary: Instances near the decision boundary in embedding space
novel: Instances far from any existing labeled example
error_pattern: Instances matching known confusion patterns from previous rounds
random: A small random sample to maintain calibration and catch blind spots

Edge Case Synthesis

Potato uses the LLM to generate synthetic examples that target known weaknesses:

yaml

solo_mode:
  phases:
    edge_case_synthesis:
      enabled: true
      count: 50
      diversity_weight: 0.3
      confusion_pairs:            # focus on these error types
        - ["neutral", "positive"]
        - ["negative", "neutral"]

The LLM generates examples that are ambiguous between the specified label pairs. The human labels them, and these labels are added to the few-shot context for subsequent LLM labeling rounds.

Prompt Optimization (DSPy-Inspired)

In Phase 11, Potato runs automated prompt optimization to find the best instruction format for the LLM:

yaml

solo_mode:
  phases:
    prompt_optimization:
      enabled: true
      candidates: 10
      metric: f1_macro
      search_strategy: bayesian
      variations:
        - instruction_style      # formal vs. conversational
        - example_ordering       # random, by-class, by-difficulty
        - reasoning_mode         # direct, chain-of-thought, self-consistency
        - example_count          # 3, 5, 10, 15 few-shot examples

Monitoring Progress

The admin dashboard shows Solo Mode progress in real time:

Current phase and progress within each phase
Human labels completed vs. total budget
LLM accuracy over time (per round)
Labeling function coverage and precision
Confidence distribution histogram
Estimated time to completion

Access from the command line:

bash

python -m potato.solo status --config config.yaml

text

Solo Mode Status
================
Current Phase: 6 (Active Labeling) - Batch 3/10
Human Labels: 142 / ~300 estimated total
LLM Accuracy: 0.89 (target: 0.92)
LF Coverage: 0.23 (labeling functions cover 23% of data)
Dataset Size: 10,000 instances
  - Human labeled: 142
  - LF labeled: 2,300
  - LLM labeled: 7,558
  - Unlabeled: 0

When to Use Solo Mode vs. Traditional Multi-Annotator

Use Solo Mode when:

You have a domain expert who can provide high-quality labels
Budget or logistics prevent hiring multiple annotators
The task has clear, well-defined categories
You need to label a large dataset (1,000+ instances)
Speed matters more than measuring inter-annotator agreement

Use traditional multi-annotator when:

You need inter-annotator agreement statistics for publication
The task is highly subjective (e.g., offensiveness, humor)
You need to study annotator disagreement patterns
Regulatory requirements mandate multiple independent annotators
The label space is complex or evolving (annotation guidelines are still being developed)

Hybrid approach: Use Solo Mode for the initial bulk labeling, then assign a second annotator to a random 10-20% sample to compute agreement statistics. This gives you the efficiency of Solo Mode with the quality assurance of multi-annotator verification.

yaml

solo_mode:
  enabled: true
  # ... solo mode config ...
 
  # Hybrid: assign verification sample to second annotator
  verification:
    enabled: true
    sample_fraction: 0.15
    annotator: "reviewer_1"

Solo Mode

The 12-Phase Workflow

Phase 1: Seed Annotation

Phase 2: Initial LLM Calibration

Phase 3: Confusion Analysis

Phase 4: Guideline Refinement

Phase 5: Labeling Function Generation

Phase 6: Active Labeling

Phase 7: Automated Refinement Loop

Phase 8: Disagreement Exploration

Phase 9: Edge Case Synthesis

Phase 10: Cascaded Confidence Escalation

Phase 11: Prompt Optimization

Phase 12: Final Validation

Configuration

Quick Start

Full Configuration Reference

Key Capabilities

Confusion Analysis

Automated Refinement Loop

Labeling Functions (ALCHEmist-Inspired)

Disagreement Explorer

Cascaded Confidence Escalation

Multi-Signal Instance Prioritization

Edge Case Synthesis

Prompt Optimization (DSPy-Inspired)

Monitoring Progress

When to Use Solo Mode vs. Traditional Multi-Annotator

Further Reading