ICL Labeling

Use in-context learning in Potato to have an LLM pre-label instances, then route ambiguous ones for human verification — scaling annotation with AI in the loop.

Potato's ICL (In-Context Learning) labeling feature enables AI-assisted annotation by using high-confidence human annotations as in-context examples to guide an LLM in labeling remaining data. The system tracks LLM confidence and routes predictions back to humans for verification.

Overview

The ICL labeling system:

Collects High-Confidence Examples: Identifies instances where annotators agree (e.g., 80%+ agreement)
Labels with LLM: Uses examples to prompt an LLM for labeling unlabeled instances
Tracks Confidence: Records LLM confidence scores for each prediction
Verifies Accuracy: Routes a sample of LLM-labeled instances to humans for blind verification
Reports Metrics: Calculates and displays LLM accuracy based on verification results

Features

Automatic Example Collection

The system automatically identifies high-confidence examples where multiple annotators agree:

Configurable agreement threshold (default: 80%)
Minimum annotator count requirement (default: 2)
Automatic refresh on configurable interval
Per-schema example pools

LLM Labeling with Limits

To enable iterative improvement rather than bulk labeling:

Max total labels: Limit the total number of LLM predictions
Max unlabeled ratio: Only label a percentage of remaining data
Pause on low accuracy: Automatically pause if accuracy drops below threshold

Blind Verification

Verification uses "blind labeling" - annotators see the instance as a normal task without knowing the LLM's prediction:

Configurable sample rate (default: 20% of LLM labels)
Multiple selection strategies: low_confidence, random, mixed
Verification tasks mixed naturally with regular assignments

Configuration

ICL labeling requires ai_support to be enabled:

yaml

# AI endpoint configuration (required)
ai_support:
  enabled: true
  endpoint_type: "openai"
  ai_config:
    model: "gpt-4o-mini"
    api_key: "${OPENAI_API_KEY}"
 
# ICL labeling configuration
icl_labeling:
  enabled: true
 
  # Example selection settings
  example_selection:
    min_agreement_threshold: 0.8      # 80% annotators must agree
    min_annotators_per_instance: 2    # Minimum annotations for consensus
    max_examples_per_schema: 10       # Max examples per schema in prompt
    refresh_interval_seconds: 300     # How often to refresh examples
 
  # LLM labeling settings
  llm_labeling:
    batch_size: 20
    trigger_threshold: 5              # Min examples before LLM labeling starts
    confidence_threshold: 0.7         # Min confidence to accept prediction
    batch_interval_seconds: 600
    max_total_labels: 100             # Max instances to label total
    max_unlabeled_ratio: 0.5          # Max portion of unlabeled to label
    pause_on_low_accuracy: true
    min_accuracy_threshold: 0.7
 
  # Human verification settings
  verification:
    enabled: true
    sample_rate: 0.2                  # 20% of LLM labels verified
    selection_strategy: "low_confidence"
    mix_with_regular_assignments: true
    assignment_mix_rate: 0.2

Selection Strategies

low_confidence: Prioritizes verifying LLM's least confident predictions first
random: Random sampling from all predictions
mixed: 50% low confidence + 50% random

Admin API

Status Endpoint

http

GET /admin/api/icl/status

Returns overall ICL labeler status including examples per schema, predictions made, verification queue size, and accuracy metrics.

Examples Endpoint

http

GET /admin/api/icl/examples?schema=sentiment

Returns high-confidence examples, optionally filtered by schema.

Accuracy Endpoint

http

GET /admin/api/icl/accuracy?schema=sentiment

Returns accuracy metrics based on human verification results.

Manual Trigger Endpoint

http

POST /admin/api/icl/trigger
Content-Type: application/json
 
{"schema_name": "sentiment"}

Manually trigger batch labeling for a specific schema.

Usage Workflow

1. Configure Your Project

yaml

ai_support:
  enabled: true
  endpoint_type: "openai"
  ai_config:
    model: "gpt-4o-mini"
    api_key: "${OPENAI_API_KEY}"
 
icl_labeling:
  enabled: true
  example_selection:
    min_agreement_threshold: 0.8
  llm_labeling:
    max_total_labels: 50  # Start small
  verification:
    enabled: true
    sample_rate: 0.3  # Verify 30% initially

2. Collect Human Annotations

Have annotators label data normally. As they reach consensus (80%+ agreement), those instances become available as examples.

3. Monitor Progress

bash

curl http://localhost:8000/admin/api/icl/status

4. Review Accuracy

bash

curl http://localhost:8000/admin/api/icl/accuracy

5. Iterate

Based on accuracy:

If accuracy is high (>80%), increase max_total_labels
If accuracy is low, add more human examples before continuing

Best Practices

Start Small: Begin with conservative limits (max_total_labels: 50) to assess accuracy before scaling
Verify Early: Use a higher sample_rate initially (0.3-0.5) to get confident accuracy estimates
Monitor Actively: Check accuracy metrics regularly through the admin API
Adjust Thresholds: If LLM accuracy is low:
- Increase min_agreement_threshold for cleaner examples
- Increase trigger_threshold for more examples before labeling
- Lower confidence_threshold to reject uncertain predictions
Use Selection Strategies:
- low_confidence: Best for identifying problematic categories
- random: Best for unbiased accuracy estimates
- mixed: Balanced approach

Troubleshooting

LLM Not Labeling

Check if ai_support is properly configured
Verify enough high-confidence examples exist
Check if labeling is paused due to limits or low accuracy

Low Accuracy

Increase min_agreement_threshold for cleaner examples
Add more annotation guidelines/instructions
Review examples being used (/admin/api/icl/examples)

Verification Tasks Not Appearing

Verify verification.enabled is true
Check mix_with_regular_assignments is true
Verify there are pending verifications in the queue

ICL Labeling

Overview

Features

Automatic Example Collection

LLM Labeling with Limits

Blind Verification

Configuration

Selection Strategies

Admin API

Status Endpoint

Examples Endpoint

Accuracy Endpoint

Manual Trigger Endpoint

Usage Workflow

1. Configure Your Project

2. Collect Human Annotations

3. Monitor Progress

4. Review Accuracy

5. Iterate

Best Practices

Troubleshooting

LLM Not Labeling

Low Accuracy

Verification Tasks Not Appearing

Further Reading