# Solo Mode: How One Annotator Can Label 10,000 Examples

Source: https://www.potatoannotator.com/blog/solo-mode-tutorial

You have 10,000 product reviews to label for sentiment (Positive, Neutral, Negative). Hiring three annotators to label all of them takes weeks and costs thousands of dollars. Solo Mode lets one domain expert get comparable quality by labeling only 500-1,000 instances; an LLM handles the rest, and the human reviews whatever the LLM is unsure about.

This tutorial walks through the whole process.

---

## What you will need

- Potato 2.3.0+ with the Solo Mode extras: `pip install potato-annotation[solo]`
- An OpenAI or Anthropic API key (for the LLM component)
- Your dataset in JSONL format
- One knowledgeable annotator (that could be you)

---

## Step 1: Prepare your data

Create `data/reviews.jsonl` with one review per line:

```json
{"id": "rev_001", "text": "Absolutely love this product! Best purchase I've made all year.", "source": "amazon"}
{"id": "rev_002", "text": "It works fine. Nothing special but gets the job done.", "source": "amazon"}
{"id": "rev_003", "text": "Broke after two weeks. Complete waste of money.", "source": "amazon"}
{"id": "rev_004", "text": "The quality is decent for the price point. I might buy again.", "source": "amazon"}
{"id": "rev_005", "text": "Arrived damaged and customer service was unhelpful.", "source": "amazon"}
```

For this tutorial, imagine this file contains 10,000 reviews.

---

## Step 2: Create the configuration

Create `config.yaml`:

```yaml
annotation_task_name: "Product Review Sentiment (Solo Mode)"
task_dir: "."

data_files:
  - "data/reviews.jsonl"

item_properties:
  id_key: id
  text_key: text

# --- Solo Mode Configuration ---
solo_mode:
  enabled: true

  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    temperature: 0.1
    max_tokens: 64

  # Quality targets
  seed_count: 50
  accuracy_threshold: 0.93
  confidence_threshold: 0.85

  # Phase-specific settings
  phases:
    seed:
      count: 50
      selection: diversity
      embedding_model: "all-MiniLM-L6-v2"

    calibration:
      batch_size: 200
      holdout_fraction: 0.2

    labeling_functions:
      enabled: true
      max_functions: 15
      min_precision: 0.92
      min_coverage: 0.01

    active_labeling:
      batch_size: 25
      strategy: hybrid
      max_batches: 15

    refinement_loop:
      max_iterations: 3
      improvement_threshold: 0.02

    disagreement_exploration:
      max_instances: 150
      show_llm_reasoning: true
      show_nearest_neighbors: 3

    edge_case_synthesis:
      enabled: true
      count: 30

    confidence_escalation:
      escalation_budget: 150
      batch_size: 25
      stop_when_stable: true

    prompt_optimization:
      enabled: true
      candidates: 8
      metric: f1_macro

    final_validation:
      sample_size: 100
      min_accuracy: 0.93

  # Instance prioritization
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05

# --- Annotation Schema ---
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the overall sentiment of this review?"
    labels:
      - "Positive"
      - "Neutral"
      - "Negative"
    label_requirement:
      required: true
    sequential_key_binding: true

output_annotation_dir: "output/"
export_annotation_format: "jsonl"

parquet_export:
  enabled: true
  output_dir: "output/parquet/"
```

---

## Step 3: Start the server

```bash
potato start config.yaml -p 8000
```

Open `http://localhost:8000` and log in. The Solo Mode dashboard appears, showing you are in Phase 1: Seed Annotation.

---

## Step 4: Phase 1 -- Seed Annotation (50 Instances)

Potato has picked 50 diverse reviews using embedding-based clustering. These are not random; they are chosen to cover your data distribution as broadly as possible.

Label each one. This is the phase that matters most, since the quality of your seed labels sets the ceiling for what the LLM can learn. Take your time and stay consistent.

**Time estimate:** 15-25 minutes at 20-30 seconds per instance.

When you finish the 50th instance, Potato advances to Phase 2 on its own.

---

## Step 5: Phase 2 -- Initial LLM Calibration

This phase runs on its own. Potato sends the LLM a batch of 200 instances with your 50 seed labels as few-shot examples, then compares the predictions against 10 held-out seed labels to estimate baseline accuracy.

A progress indicator shows up in the dashboard. This usually takes 1-2 minutes depending on the LLM provider.

**Typical result:** The LLM lands at 75-85% accuracy on the first calibration. That's expected. It hasn't learned your annotation style yet.

---

## Step 6: Phase 3 -- Confusion Analysis

Potato displays a confusion matrix showing where the LLM disagrees with your labels. A typical output:

```
Confusion Analysis (Round 1)
============================
Overall Accuracy: 0.82 (target: 0.93)

Top Confusion Pairs:
  Neutral -> Positive:  14 instances (7.0%)
  Negative -> Neutral:   9 instances (4.5%)
  Positive -> Neutral:   4 instances (2.0%)
```

This points to the LLM's main weakness here: it keeps upgrading neutral reviews to positive. That's a common one, since LLMs tend to lean positive.

**Your action:** Look through the confusion pairs. Click each pair to see the specific instances the LLM got wrong, which is the quickest way to understand how it fails.

---

## Step 7: Phase 4 -- Guideline Refinement

Based on the confusion analysis, Potato drafts refined guidelines for the LLM and shows them side by side: the current prompt on one side, and the specific edits it proposes from the error patterns on the other.

For example, Potato might suggest adding:

> "Reviews that describe a product as 'fine', 'okay', or 'decent' without strong emotion should be labeled Neutral, even if they mention buying again."

Go through each suggested edit and approve, modify, or reject it. You can also write in your own clarifications.

**Time estimate:** 5-10 minutes.

---

## Step 8: Phase 5 -- Labeling Function Generation

Potato generates programmatic labeling functions from patterns in your seed labels. These are fast, deterministic rules that handle easy cases:

```
Generated Labeling Functions:
  LF1: Strong positive words (love, amazing, best, excellent)
       Precision: 0.97, Coverage: 0.06
  LF2: Strong negative words (terrible, awful, worst, waste)
       Precision: 0.95, Coverage: 0.04
  LF3: Exclamation + positive adjective
       Precision: 0.94, Coverage: 0.03
  LF4: Return/refund mention + negative context
       Precision: 0.92, Coverage: 0.02
  ...
  Total coverage: 0.18 (1,800 of 10,000 instances)
```

Labeling functions cover 18% of your dataset at 92%+ precision. Those instances get labeled automatically, which frees up the LLM and the human for the harder cases.

**Your action:** Look over the generated functions and disable any that seem unreliable. This step is optional, since Potato only keeps functions above your configured precision threshold anyway.

---

## Step 9: Phase 6 -- Active Labeling (125-375 Instances)

This is where you do most of your labeling. Potato picks instances using the six-pool prioritization system:

- **Uncertain** (30%): Reviews where the LLM's confidence is below 85%
- **Disagreement** (25%): Reviews where the LLM and labeling functions give different labels
- **Boundary** (20%): Reviews near the decision boundary in embedding space
- **Novel** (10%): Reviews unlike anything you have labeled so far
- **Error pattern** (10%): Reviews matching known confusion patterns (e.g., lukewarm-positive)
- **Random** (5%): Random reviews for calibration

You label these in batches of 25. After each batch, Potato updates the LLM's accuracy estimate and decides whether to keep going.

**Typical trajectory:**
- Batch 1-3 (75 instances): Accuracy climbs from 82% to 87%
- Batch 4-6 (150 instances): Accuracy reaches 90%
- Batch 7-10 (250 instances): Accuracy plateaus at 91-92%

If accuracy hits 93% (your threshold), Solo Mode jumps ahead to Phase 10. Otherwise it moves on to Phase 7.

**Time estimate:** 45-90 minutes total, depending on how many batches are needed.

---

## Step 10: Phase 7 -- Automated Refinement Loop

If accuracy is still under threshold after active labeling, Potato runs another round of the refinement loop:

1. The LLM re-labels the full dataset with updated guidelines and more few-shot examples
2. Potato recomputes accuracy against all human labels
3. It finds the new confusion patterns
4. It refines the guidelines again

This phase is mostly hands-off. You only need to approve the guideline changes.

**Typical result:** Accuracy improves by 2-4% per refinement round.

---

## Step 11: Phase 8 -- Disagreement Exploration

Potato presents the most contentious instances: cases where the LLM, labeling functions, and nearest-neighbor analysis all give different answers. For each instance, you see:

- The review text
- LLM prediction and confidence
- Labeling function votes
- 3 nearest labeled examples with their labels
- The LLM's chain-of-thought reasoning

These are genuinely hard cases, and your labels here are worth more than any others in the whole process.

**Time estimate:** 20-30 minutes for 100-150 instances.

---

## Step 12: Phase 9 -- Edge Case Synthesis

Potato generates synthetic reviews targeting the remaining confusion patterns. For example, if the LLM still struggles with "neutral reviews that mention buying again," it generates examples like:

> "It's an okay product for the price. I might get another one if there's a sale."

You label these synthetic examples, and Potato adds them to the LLM's few-shot context.

**Time estimate:** 10-15 minutes for 30 examples.

---

## Step 13: Phase 10 -- Cascaded Confidence Escalation

By now the LLM has labeled most of the dataset. Potato ranks all of its labels by confidence and sends you the lowest-confidence ones in batches of 25.

```
Confidence Escalation Progress:
  Batch 1: 25 instances, 23/25 correct (92%)
  Batch 2: 25 instances, 24/25 correct (96%)
  Batch 3: 25 instances, 25/25 correct (100%)
  -> Stopping: last 3 batches stable
```

Once three batches in a row come back with the LLM right on everything, Solo Mode treats the remaining high-confidence labels as trustworthy.

**Time estimate:** 15-20 minutes.

---

## Step 14: Phase 11 -- Prompt Optimization

This phase runs on its own. Potato tries 8 prompt variants and keeps the one with the highest F1 score on your accumulated human labels:

```
Prompt Optimization Results:
  Variant 1 (direct, 5 examples):     F1=0.91
  Variant 2 (CoT, 5 examples):        F1=0.93
  Variant 3 (direct, 10 examples):    F1=0.92
  Variant 4 (CoT, 10 examples):       F1=0.94  <-- selected
  Variant 5 (direct, 15 examples):    F1=0.92
  Variant 6 (CoT, 15 examples):       F1=0.93
  Variant 7 (self-consistency, 5x):   F1=0.94
  Variant 8 (self-consistency, 10x):  F1=0.94
```

It then uses the best prompt for a final re-labeling pass.

---

## Step 15: Phase 12 -- Final Validation

Potato pulls 100 random LLM-labeled instances for you to review. You label them, and Potato compares your labels against the LLM's.

```
Final Validation:
  Reviewed: 100 instances
  LLM correct: 94/100 (94%)
  Threshold: 93%
  -> PASSED
```

If the LLM clears your threshold, the dataset is done. If not, Solo Mode cycles back to Phase 6 for another round of active labeling.

**Time estimate:** 10-15 minutes.

---

## Results Summary

After all 12 phases, check the final statistics:

```bash
python -m potato.solo status --config config.yaml
```

```
Solo Mode Complete
==================
Dataset: 10,000 instances
Total human labels: 612
  Seed: 50
  Active labeling: 275
  Disagreement exploration: 137
  Edge case synthesis: 30
  Confidence escalation: 75
  Final validation: 45

LLM labels: 8,200 (accuracy: 94.1%)
LF labels: 1,800 (precision: 95.3%)
Unlabeled: 0

Final label distribution:
  Positive: 4,823 (48.2%)
  Neutral:  3,011 (30.1%)
  Negative: 2,166 (21.7%)

Total human time: ~3.5 hours
Estimated multi-annotator cost (3x): ~$4,500
Solo Mode cost: ~$450 (API fees) + ~$175 (annotator time)
Savings: ~88%
```

The human labeled 612 of 10,000 instances, about 6%. The LLM and labeling functions handled the rest at 94%+ accuracy.

---

## Exporting Results

Export the final labeled dataset:

```bash
python -m potato.solo export --config config.yaml --output final_labels.jsonl
```

Each line includes the label and its source:

```json
{"id": "rev_001", "sentiment": "Positive", "source": "human", "confidence": 1.0}
{"id": "rev_002", "sentiment": "Neutral", "source": "llm", "confidence": 0.91}
{"id": "rev_003", "sentiment": "Negative", "source": "labeling_function", "confidence": 0.97}
```

For Parquet export:

```python
import pandas as pd
df = pd.read_parquet("output/parquet/annotations.parquet")
print(df["value"].value_counts())
```

---

## Quality Assurance: Hybrid Verification

For publication-quality datasets, add a second annotator to review a sample. The [Solo Mode source documentation](https://github.com/davidjurgens/potato/blob/master/docs/solo-mode/solo_mode.md) describes the verification options in more detail.

```yaml
solo_mode:
  verification:
    enabled: true
    sample_fraction: 0.10
    annotator: "reviewer_1"
```

This sends 1,000 random instances to a second annotator. You can then compute inter-annotator agreement between the Solo Mode labels and the reviewer's labels.

---

## Troubleshooting

### LLM accuracy plateaus below threshold

- **Increase seed count:** Try 75-100 seed instances instead of 50
- **Switch LLM:** Try `claude-sonnet-4-20250514` instead of GPT-4o (or vice versa)
- **Lower the threshold:** If 93% is not achievable, consider whether 90% is acceptable for your use case
- **Check your data:** Some datasets are inherently ambiguous. If human-human agreement would only be 90%, do not expect the LLM to do better

### Phase 6 takes too many batches

- **Increase batch size:** Change `batch_size` from 25 to 50
- **Adjust pool weights:** If most escalated instances are from the "uncertain" pool, reduce its weight and increase "disagreement" and "error_pattern"

### Labeling functions have low coverage

- This is normal for tasks without strong lexical signals (e.g., sarcasm detection, implicit sentiment)
- Labeling functions work best for explicit, keyword-driven patterns
- Solo Mode still works without labeling functions -- the LLM picks up the slack

---

## Further Reading

- [Solo Mode Documentation](/docs/features/solo-mode) -- full configuration reference
- [Active Learning](/docs/features/active-learning) -- the underlying selection algorithm
- [AI Support](/docs/features/ai-support) -- LLM provider configuration
- [Quality Control](/docs/features/quality-control) -- additional quality assurance options
- [Parquet Export](/docs/features/parquet-export) -- efficient data export
