Skip to content
Docs/Features

ICL Labeling

AI-assisted in-context learning with human verification for scalable annotation.

AI-Assisted ICL Labeling

Potato's ICL (In-Context Learning) labeling feature enables AI-assisted annotation by using high-confidence human annotations as in-context examples to guide an LLM in labeling remaining data. The system tracks LLM confidence and routes predictions back to humans for verification.

Overview

The ICL labeling system:

  1. Collects High-Confidence Examples: Identifies instances where annotators agree (e.g., 80%+ agreement)
  2. Labels with LLM: Uses examples to prompt an LLM for labeling unlabeled instances
  3. Tracks Confidence: Records LLM confidence scores for each prediction
  4. Verifies Accuracy: Routes a sample of LLM-labeled instances to humans for blind verification
  5. Reports Metrics: Calculates and displays LLM accuracy based on verification results

Features

Automatic Example Collection

The system automatically identifies high-confidence examples where multiple annotators agree:

  • Configurable agreement threshold (default: 80%)
  • Minimum annotator count requirement (default: 2)
  • Automatic refresh on configurable interval
  • Per-schema example pools

LLM Labeling with Limits

To enable iterative improvement rather than bulk labeling:

  • Max total labels: Limit the total number of LLM predictions
  • Max unlabeled ratio: Only label a percentage of remaining data
  • Pause on low accuracy: Automatically pause if accuracy drops below threshold

Blind Verification

Verification uses "blind labeling" - annotators see the instance as a normal task without knowing the LLM's prediction:

  • Configurable sample rate (default: 20% of LLM labels)
  • Multiple selection strategies: low_confidence, random, mixed
  • Verification tasks mixed naturally with regular assignments

Configuration

ICL labeling requires ai_support to be enabled:

yaml
# AI endpoint configuration (required)
ai_support:
  enabled: true
  endpoint_type: "openai"
  ai_config:
    model: "gpt-4o-mini"
    api_key: "${OPENAI_API_KEY}"
 
# ICL labeling configuration
icl_labeling:
  enabled: true
 
  # Example selection settings
  example_selection:
    min_agreement_threshold: 0.8      # 80% annotators must agree
    min_annotators_per_instance: 2    # Minimum annotations for consensus
    max_examples_per_schema: 10       # Max examples per schema in prompt
    refresh_interval_seconds: 300     # How often to refresh examples
 
  # LLM labeling settings
  llm_labeling:
    batch_size: 20
    trigger_threshold: 5              # Min examples before LLM labeling starts
    confidence_threshold: 0.7         # Min confidence to accept prediction
    batch_interval_seconds: 600
    max_total_labels: 100             # Max instances to label total
    max_unlabeled_ratio: 0.5          # Max portion of unlabeled to label
    pause_on_low_accuracy: true
    min_accuracy_threshold: 0.7
 
  # Human verification settings
  verification:
    enabled: true
    sample_rate: 0.2                  # 20% of LLM labels verified
    selection_strategy: "low_confidence"
    mix_with_regular_assignments: true
    assignment_mix_rate: 0.2

Selection Strategies

  • low_confidence: Prioritizes verifying LLM's least confident predictions first
  • random: Random sampling from all predictions
  • mixed: 50% low confidence + 50% random

Admin API

Status Endpoint

http
GET /admin/api/icl/status

Returns overall ICL labeler status including examples per schema, predictions made, verification queue size, and accuracy metrics.

Examples Endpoint

http
GET /admin/api/icl/examples?schema=sentiment

Returns high-confidence examples, optionally filtered by schema.

Accuracy Endpoint

http
GET /admin/api/icl/accuracy?schema=sentiment

Returns accuracy metrics based on human verification results.

Manual Trigger Endpoint

http
POST /admin/api/icl/trigger
Content-Type: application/json
 
{"schema_name": "sentiment"}

Manually trigger batch labeling for a specific schema.

Usage Workflow

1. Configure Your Project

yaml
ai_support:
  enabled: true
  endpoint_type: "openai"
  ai_config:
    model: "gpt-4o-mini"
    api_key: "${OPENAI_API_KEY}"
 
icl_labeling:
  enabled: true
  example_selection:
    min_agreement_threshold: 0.8
  llm_labeling:
    max_total_labels: 50  # Start small
  verification:
    enabled: true
    sample_rate: 0.3  # Verify 30% initially

2. Collect Human Annotations

Have annotators label data normally. As they reach consensus (80%+ agreement), those instances become available as examples.

3. Monitor Progress

bash
curl http://localhost:8000/admin/api/icl/status

4. Review Accuracy

bash
curl http://localhost:8000/admin/api/icl/accuracy

5. Iterate

Based on accuracy:

  • If accuracy is high (>80%), increase max_total_labels
  • If accuracy is low, add more human examples before continuing

Best Practices

  1. Start Small: Begin with conservative limits (max_total_labels: 50) to assess accuracy before scaling

  2. Verify Early: Use a higher sample_rate initially (0.3-0.5) to get confident accuracy estimates

  3. Monitor Actively: Check accuracy metrics regularly through the admin API

  4. Adjust Thresholds: If LLM accuracy is low:

    • Increase min_agreement_threshold for cleaner examples
    • Increase trigger_threshold for more examples before labeling
    • Lower confidence_threshold to reject uncertain predictions
  5. Use Selection Strategies:

    • low_confidence: Best for identifying problematic categories
    • random: Best for unbiased accuracy estimates
    • mixed: Balanced approach

Troubleshooting

LLM Not Labeling

  1. Check if ai_support is properly configured
  2. Verify enough high-confidence examples exist
  3. Check if labeling is paused due to limits or low accuracy

Low Accuracy

  1. Increase min_agreement_threshold for cleaner examples
  2. Add more annotation guidelines/instructions
  3. Review examples being used (/admin/api/icl/examples)

Verification Tasks Not Appearing

  1. Verify verification.enabled is true
  2. Check mix_with_regular_assignments is true
  3. Verify there are pending verifications in the queue

Further Reading

For implementation details, see the source documentation.