ICL Labeling
AI-assisted in-context learning with human verification for scalable annotation.
AI-Assisted ICL Labeling
Potato's ICL (In-Context Learning) labeling feature enables AI-assisted annotation by using high-confidence human annotations as in-context examples to guide an LLM in labeling remaining data. The system tracks LLM confidence and routes predictions back to humans for verification.
Overview
The ICL labeling system:
- Collects High-Confidence Examples: Identifies instances where annotators agree (e.g., 80%+ agreement)
- Labels with LLM: Uses examples to prompt an LLM for labeling unlabeled instances
- Tracks Confidence: Records LLM confidence scores for each prediction
- Verifies Accuracy: Routes a sample of LLM-labeled instances to humans for blind verification
- Reports Metrics: Calculates and displays LLM accuracy based on verification results
Features
Automatic Example Collection
The system automatically identifies high-confidence examples where multiple annotators agree:
- Configurable agreement threshold (default: 80%)
- Minimum annotator count requirement (default: 2)
- Automatic refresh on configurable interval
- Per-schema example pools
LLM Labeling with Limits
To enable iterative improvement rather than bulk labeling:
- Max total labels: Limit the total number of LLM predictions
- Max unlabeled ratio: Only label a percentage of remaining data
- Pause on low accuracy: Automatically pause if accuracy drops below threshold
Blind Verification
Verification uses "blind labeling" - annotators see the instance as a normal task without knowing the LLM's prediction:
- Configurable sample rate (default: 20% of LLM labels)
- Multiple selection strategies:
low_confidence,random,mixed - Verification tasks mixed naturally with regular assignments
Configuration
ICL labeling requires ai_support to be enabled:
# AI endpoint configuration (required)
ai_support:
enabled: true
endpoint_type: "openai"
ai_config:
model: "gpt-4o-mini"
api_key: "${OPENAI_API_KEY}"
# ICL labeling configuration
icl_labeling:
enabled: true
# Example selection settings
example_selection:
min_agreement_threshold: 0.8 # 80% annotators must agree
min_annotators_per_instance: 2 # Minimum annotations for consensus
max_examples_per_schema: 10 # Max examples per schema in prompt
refresh_interval_seconds: 300 # How often to refresh examples
# LLM labeling settings
llm_labeling:
batch_size: 20
trigger_threshold: 5 # Min examples before LLM labeling starts
confidence_threshold: 0.7 # Min confidence to accept prediction
batch_interval_seconds: 600
max_total_labels: 100 # Max instances to label total
max_unlabeled_ratio: 0.5 # Max portion of unlabeled to label
pause_on_low_accuracy: true
min_accuracy_threshold: 0.7
# Human verification settings
verification:
enabled: true
sample_rate: 0.2 # 20% of LLM labels verified
selection_strategy: "low_confidence"
mix_with_regular_assignments: true
assignment_mix_rate: 0.2Selection Strategies
- low_confidence: Prioritizes verifying LLM's least confident predictions first
- random: Random sampling from all predictions
- mixed: 50% low confidence + 50% random
Admin API
Status Endpoint
GET /admin/api/icl/statusReturns overall ICL labeler status including examples per schema, predictions made, verification queue size, and accuracy metrics.
Examples Endpoint
GET /admin/api/icl/examples?schema=sentimentReturns high-confidence examples, optionally filtered by schema.
Accuracy Endpoint
GET /admin/api/icl/accuracy?schema=sentimentReturns accuracy metrics based on human verification results.
Manual Trigger Endpoint
POST /admin/api/icl/trigger
Content-Type: application/json
{"schema_name": "sentiment"}Manually trigger batch labeling for a specific schema.
Usage Workflow
1. Configure Your Project
ai_support:
enabled: true
endpoint_type: "openai"
ai_config:
model: "gpt-4o-mini"
api_key: "${OPENAI_API_KEY}"
icl_labeling:
enabled: true
example_selection:
min_agreement_threshold: 0.8
llm_labeling:
max_total_labels: 50 # Start small
verification:
enabled: true
sample_rate: 0.3 # Verify 30% initially2. Collect Human Annotations
Have annotators label data normally. As they reach consensus (80%+ agreement), those instances become available as examples.
3. Monitor Progress
curl http://localhost:8000/admin/api/icl/status4. Review Accuracy
curl http://localhost:8000/admin/api/icl/accuracy5. Iterate
Based on accuracy:
- If accuracy is high (>80%), increase
max_total_labels - If accuracy is low, add more human examples before continuing
Best Practices
-
Start Small: Begin with conservative limits (
max_total_labels: 50) to assess accuracy before scaling -
Verify Early: Use a higher
sample_rateinitially (0.3-0.5) to get confident accuracy estimates -
Monitor Actively: Check accuracy metrics regularly through the admin API
-
Adjust Thresholds: If LLM accuracy is low:
- Increase
min_agreement_thresholdfor cleaner examples - Increase
trigger_thresholdfor more examples before labeling - Lower
confidence_thresholdto reject uncertain predictions
- Increase
-
Use Selection Strategies:
low_confidence: Best for identifying problematic categoriesrandom: Best for unbiased accuracy estimatesmixed: Balanced approach
Troubleshooting
LLM Not Labeling
- Check if
ai_supportis properly configured - Verify enough high-confidence examples exist
- Check if labeling is paused due to limits or low accuracy
Low Accuracy
- Increase
min_agreement_thresholdfor cleaner examples - Add more annotation guidelines/instructions
- Review examples being used (
/admin/api/icl/examples)
Verification Tasks Not Appearing
- Verify
verification.enabledis true - Check
mix_with_regular_assignmentsis true - Verify there are pending verifications in the queue
Further Reading
- AI Support - General AI endpoint configuration
- Active Learning - Related AI-assisted features
- Quality Control - Accuracy tracking
For implementation details, see the source documentation.