Skip to content
Blog/Tutorials
Tutorials11 min read

Solo Mode: How One Annotator Can Label 10,000 Examples

Step-by-step tutorial on using Potato's Solo Mode to efficiently label large datasets with human-LLM collaboration, reducing annotation cost by up to 90%.

By Potato Team·

Solo Mode: How One Annotator Can Label 10,000 Examples

You have 10,000 product reviews to label for sentiment (Positive, Neutral, Negative). Hiring three annotators to label everything would take weeks and cost thousands of dollars. With Solo Mode, a single domain expert can achieve comparable quality by labeling only 500-1,000 instances while an LLM handles the rest -- with the human reviewing every decision the LLM is uncertain about.

This tutorial walks through the entire process end to end.


What You Will Need

  • Potato 2.3.0+ with the Solo Mode extras: pip install potato-annotation[solo]
  • An OpenAI or Anthropic API key (for the LLM component)
  • Your dataset in JSONL format
  • One knowledgeable annotator (that could be you)

Step 1: Prepare Your Data

Create data/reviews.jsonl with one review per line:

json
{"id": "rev_001", "text": "Absolutely love this product! Best purchase I've made all year.", "source": "amazon"}
{"id": "rev_002", "text": "It works fine. Nothing special but gets the job done.", "source": "amazon"}
{"id": "rev_003", "text": "Broke after two weeks. Complete waste of money.", "source": "amazon"}
{"id": "rev_004", "text": "The quality is decent for the price point. I might buy again.", "source": "amazon"}
{"id": "rev_005", "text": "Arrived damaged and customer service was unhelpful.", "source": "amazon"}

For this tutorial, imagine this file contains 10,000 reviews.


Step 2: Create the Configuration

Create config.yaml:

yaml
task_name: "Product Review Sentiment (Solo Mode)"
task_dir: "."
 
data_files:
  - "data/reviews.jsonl"
 
item_properties:
  id_key: id
  text_key: text
 
# --- Solo Mode Configuration ---
solo_mode:
  enabled: true
 
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    temperature: 0.1
    max_tokens: 64
 
  # Quality targets
  seed_count: 50
  accuracy_threshold: 0.93
  confidence_threshold: 0.85
 
  # Phase-specific settings
  phases:
    seed:
      count: 50
      selection: diversity
      embedding_model: "all-MiniLM-L6-v2"
 
    calibration:
      batch_size: 200
      holdout_fraction: 0.2
 
    labeling_functions:
      enabled: true
      max_functions: 15
      min_precision: 0.92
      min_coverage: 0.01
 
    active_labeling:
      batch_size: 25
      strategy: hybrid
      max_batches: 15
 
    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02
 
    disagreement_exploration:
      max_instances: 150
      show_llm_reasoning: true
      show_nearest_neighbors: 3
 
    edge_case_synthesis:
      enabled: true
      count: 30
 
    confidence_escalation:
      escalation_budget: 150
      batch_size: 25
      stop_when_stable: true
 
    prompt_optimization:
      enabled: true
      candidates: 8
      metric: f1_macro
 
    final_validation:
      sample_size: 100
      min_accuracy: 0.93
 
  # Instance prioritization
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05
 
# --- Annotation Schema ---
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the overall sentiment of this review?"
    labels:
      - "Positive"
      - "Neutral"
      - "Negative"
    label_requirement:
      required: true
    sequential_key_binding: true
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"

Step 3: Start the Server

bash
potato start config.yaml -p 8000

Open http://localhost:8000 and log in. The Solo Mode dashboard will appear, showing you are in Phase 1: Seed Annotation.


Step 4: Phase 1 -- Seed Annotation (50 Instances)

Potato has selected 50 diverse reviews using embedding-based clustering. These are not random; they are chosen to maximize coverage of your data distribution.

Label each one. This is the most important phase -- the quality of your seed labels determines how well the LLM will learn. Take your time and be consistent.

Time estimate: 15-25 minutes at 20-30 seconds per instance.

When you finish the 50th instance, Potato automatically advances to Phase 2.


Step 5: Phase 2 -- Initial LLM Calibration

This phase runs automatically. Potato sends the LLM a batch of 200 instances with your 50 seed labels as few-shot examples. It then compares the LLM's predictions against 10 held-out seed labels to estimate baseline accuracy.

You will see a progress indicator in the dashboard. This typically takes 1-2 minutes depending on the LLM provider.

Typical result: The LLM achieves 75-85% accuracy on the first calibration. This is expected -- the LLM has not yet learned your specific annotation style.


Step 6: Phase 3 -- Confusion Analysis

Potato displays a confusion matrix showing where the LLM disagrees with your labels. A typical output:

text
Confusion Analysis (Round 1)
============================
Overall Accuracy: 0.82 (target: 0.93)

Top Confusion Pairs:
  Neutral -> Positive:  14 instances (7.0%)
  Negative -> Neutral:   9 instances (4.5%)
  Positive -> Neutral:   4 instances (2.0%)

This tells you the LLM's main weakness: it tends to upgrade neutral reviews to positive. This is common -- LLMs are often biased toward positive sentiment.

Your action: Review the confusion pairs. Click on each pair to see the specific instances the LLM got wrong. This helps you understand the LLM's failure modes.


Step 7: Phase 4 -- Guideline Refinement

Based on the confusion analysis, Potato generates refined guidelines for the LLM. You see a side-by-side view:

  • Current guidelines: The initial prompt used for the LLM
  • Suggested edits: Specific changes the LLM proposes based on error patterns

For example, Potato might suggest adding:

"Reviews that describe a product as 'fine', 'okay', or 'decent' without strong emotion should be labeled Neutral, even if they mention buying again."

Review each suggested edit. Approve, modify, or reject each one. You can also add your own clarifications.

Time estimate: 5-10 minutes.


Step 8: Phase 5 -- Labeling Function Generation

Potato generates programmatic labeling functions from patterns in your seed labels. These are fast, deterministic rules that handle easy cases:

text
Generated Labeling Functions:
  LF1: Strong positive words (love, amazing, best, excellent)
       Precision: 0.97, Coverage: 0.06
  LF2: Strong negative words (terrible, awful, worst, waste)
       Precision: 0.95, Coverage: 0.04
  LF3: Exclamation + positive adjective
       Precision: 0.94, Coverage: 0.03
  LF4: Return/refund mention + negative context
       Precision: 0.92, Coverage: 0.02
  ...
  Total coverage: 0.18 (1,800 of 10,000 instances)

Labeling functions cover 18% of your dataset with 92%+ precision. These instances are labeled automatically, freeing the LLM and human effort for harder cases.

Your action: Review the generated functions. Disable any that seem unreliable. This is optional -- Potato only keeps functions above your configured precision threshold.


Step 9: Phase 6 -- Active Labeling (125-375 Instances)

This is the main human labeling phase. Potato selects instances using the six-pool prioritization system:

  • Uncertain (30%): Reviews where the LLM's confidence is below 85%
  • Disagreement (25%): Reviews where the LLM and labeling functions give different labels
  • Boundary (20%): Reviews near the decision boundary in embedding space
  • Novel (10%): Reviews unlike anything you have labeled so far
  • Error pattern (10%): Reviews matching known confusion patterns (e.g., lukewarm-positive)
  • Random (5%): Random reviews for calibration

You label these in batches of 25. After each batch, Potato updates the LLM's accuracy estimate and decides whether to continue.

Typical trajectory:

  • Batch 1-3 (75 instances): Accuracy climbs from 82% to 87%
  • Batch 4-6 (150 instances): Accuracy reaches 90%
  • Batch 7-10 (250 instances): Accuracy plateaus at 91-92%

If accuracy reaches 93% (your threshold), Solo Mode jumps ahead to Phase 10. Otherwise, it continues to Phase 7.

Time estimate: 45-90 minutes total, depending on how many batches are needed.


Step 10: Phase 7 -- Automated Refinement Loop

If accuracy is still below threshold after active labeling, Potato runs another round of the refinement loop:

  1. LLM re-labels the full dataset with updated guidelines and more few-shot examples
  2. Accuracy is recomputed against all human labels
  3. New confusion patterns are identified
  4. Guidelines are refined again

This phase is mostly automatic. You only need to approve guideline changes.

Typical result: Accuracy improves by 2-4% per refinement round.


Step 11: Phase 8 -- Disagreement Exploration

Potato presents the most contentious instances: cases where the LLM, labeling functions, and nearest-neighbor analysis all give different answers. For each instance, you see:

  • The review text
  • LLM prediction and confidence
  • Labeling function votes
  • 3 nearest labeled examples with their labels
  • The LLM's chain-of-thought reasoning

These are genuinely hard cases. Your labels here have the highest marginal value of any annotation in the entire process.

Time estimate: 20-30 minutes for 100-150 instances.


Step 12: Phase 9 -- Edge Case Synthesis

Potato generates synthetic reviews targeting the remaining confusion patterns. For example, if the LLM still struggles with "neutral reviews that mention buying again," it generates examples like:

"It's an okay product for the price. I might get another one if there's a sale."

You label these synthetic examples, and they are added to the LLM's few-shot context.

Time estimate: 10-15 minutes for 30 examples.


Step 13: Phase 10 -- Cascaded Confidence Escalation

The LLM has now labeled most of the dataset. Potato ranks all LLM-labeled instances by confidence and sends the lowest-confidence ones to you in batches of 25.

text
Confidence Escalation Progress:
  Batch 1: 25 instances, 23/25 correct (92%)
  Batch 2: 25 instances, 24/25 correct (96%)
  Batch 3: 25 instances, 25/25 correct (100%)
  -> Stopping: last 3 batches stable

Once you see three consecutive batches where the LLM got everything right, Solo Mode concludes that the remaining high-confidence labels are trustworthy.

Time estimate: 15-20 minutes.


Step 14: Phase 11 -- Prompt Optimization

This phase runs automatically. Potato tries 8 prompt variants and selects the one with the highest F1 score on your accumulated human labels:

text
Prompt Optimization Results:
  Variant 1 (direct, 5 examples):     F1=0.91
  Variant 2 (CoT, 5 examples):        F1=0.93
  Variant 3 (direct, 10 examples):    F1=0.92
  Variant 4 (CoT, 10 examples):       F1=0.94  <-- selected
  Variant 5 (direct, 15 examples):    F1=0.92
  Variant 6 (CoT, 15 examples):       F1=0.93
  Variant 7 (self-consistency, 5x):   F1=0.94
  Variant 8 (self-consistency, 10x):  F1=0.94

The best prompt is used for a final re-labeling pass.


Step 15: Phase 12 -- Final Validation

Potato selects 100 random LLM-labeled instances for you to review. You label them, and Potato compares against the LLM's labels.

text
Final Validation:
  Reviewed: 100 instances
  LLM correct: 94/100 (94%)
  Threshold: 93%
  -> PASSED

If the LLM's accuracy meets your threshold, the dataset is complete. If not, Solo Mode cycles back to Phase 6 for another round of active labeling.

Time estimate: 10-15 minutes.


Results Summary

After running through all 12 phases, check the final statistics:

bash
python -m potato.solo status --config config.yaml
text
Solo Mode Complete
==================
Dataset: 10,000 instances
Total human labels: 612
  Seed: 50
  Active labeling: 275
  Disagreement exploration: 137
  Edge case synthesis: 30
  Confidence escalation: 75
  Final validation: 45

LLM labels: 8,200 (accuracy: 94.1%)
LF labels: 1,800 (precision: 95.3%)
Unlabeled: 0

Final label distribution:
  Positive: 4,823 (48.2%)
  Neutral:  3,011 (30.1%)
  Negative: 2,166 (21.7%)

Total human time: ~3.5 hours
Estimated multi-annotator cost (3x): ~$4,500
Solo Mode cost: ~$450 (API fees) + ~$175 (annotator time)
Savings: ~88%

The human labeled 612 out of 10,000 instances (6.1%). The LLM and labeling functions handled the rest at 94%+ accuracy.


Exporting Results

Export the final labeled dataset:

bash
python -m potato.solo export --config config.yaml --output final_labels.jsonl

Each line includes the label and its source:

json
{"id": "rev_001", "sentiment": "Positive", "source": "human", "confidence": 1.0}
{"id": "rev_002", "sentiment": "Neutral", "source": "llm", "confidence": 0.91}
{"id": "rev_003", "sentiment": "Negative", "source": "labeling_function", "confidence": 0.97}

For Parquet export:

python
import pandas as pd
df = pd.read_parquet("output/parquet/annotations.parquet")
print(df["value"].value_counts())

Quality Assurance: Hybrid Verification

For publication-quality datasets, add a second annotator to review a sample:

yaml
solo_mode:
  verification:
    enabled: true
    sample_fraction: 0.10
    annotator: "reviewer_1"

This assigns 1,000 random instances to a second annotator. You can then compute inter-annotator agreement between the Solo Mode labels and the reviewer's labels.


Troubleshooting

LLM accuracy plateaus below threshold

  • Increase seed count: Try 75-100 seed instances instead of 50
  • Switch LLM: Try claude-sonnet-4-20250514 instead of GPT-4o (or vice versa)
  • Lower the threshold: If 93% is not achievable, consider whether 90% is acceptable for your use case
  • Check your data: Some datasets are inherently ambiguous. If human-human agreement would only be 90%, do not expect the LLM to do better

Phase 6 takes too many batches

  • Increase batch size: Change batch_size from 25 to 50
  • Adjust pool weights: If most escalated instances are from the "uncertain" pool, reduce its weight and increase "disagreement" and "error_pattern"

Labeling functions have low coverage

  • This is normal for tasks without strong lexical signals (e.g., sarcasm detection, implicit sentiment)
  • Labeling functions work best for explicit, keyword-driven patterns
  • Solo Mode still works without labeling functions -- the LLM picks up the slack

Further Reading