LLM-as-Judge Calibration

Auto-label data with one or more LLM judges, then run a blind human calibration pass to measure accuracy, agreement, and calibration error. Answers "should I trust this LLM judge?" with a defensible, reproducible workflow.

Judge Calibration auto-labels your data with one or more LLM judges, then calibrates them against blind human labels so you can quantify how far to trust an LLM-as-a-judge. You write a judge prompt, pick the models, and Potato samples each one k times over your data. You then blind-label a sample without seeing the model answers, and Potato reports per-model accuracy, human↔model and model↔model agreement, calibration error, and confusion matrices.

Using an LLM to grade model outputs is now common in agent and model evaluation, but a judge is only useful if you know how well it tracks human judgment. Calibration is the measurement step that makes that trust defensible.

How it works

text

SETUP → GENERATING → HUMAN_CALIBRATION → REPORT → COMPLETED

Generating — each model is queried k times per item. The modal label is the prediction; the fraction of the k samples that agree with it is the model's confidence. Results go to a dedicated store, never mixed into the annotation data, so humans cannot see them.
Human calibration — Potato draws a random or stratified sample of the labeled items, and one or more humans blind-label them through the normal annotation interface.
Report — metrics are computed over the human∩model overlap and written to the output directory.

Because the model labels live in a separate store and are never injected into the UI, blindness is structural rather than a matter of annotator discipline.

Quick start

Run the included example from the repository root:

bash

python potato/flask_server.py start examples/ai-assisted/judge-calibration/config.yaml -p 8000 --debug

Open http://localhost:8000/judge_calibration/admin to configure and run.
When generation finishes, blind-label the sample at http://localhost:8000/annotate.
Click Build report, then open http://localhost:8000/judge_calibration/report.

The example uses a local Ollama model, so no API key is required. Start Ollama and run ollama pull llama3.2:3b first.

Configuration

yaml

judge_calibration:
  enabled: true
  prompt: |                       # supports {text}, {labels}, {description}
    You are an impartial expert annotator. Classify the sentiment as exactly
    one of: positive, negative, neutral.
  models:
    - endpoint_type: openai        # openai | anthropic | ollama | vllm | gemini | openrouter | huggingface
      model: gpt-4o-mini
      api_key: ${OPENAI_API_KEY}   # env-var expansion supported
      temperature: 0.7             # must be > 0 so the k samples vary
    - endpoint_type: ollama
      model: llama3.1:8b
      base_url: http://localhost:11434
      temperature: 0.7
  k_samples: 5                     # samples per model per item
  max_items: 1000                  # cap on items the LLMs label (null = all)
  sampling:
    strategy: stratified           # random | stratified | all
    sample_size: 200               # how many items humans blind-label
    seed: 42
  human:
    num_raters: 1                  # 1 = solo researcher; N adds human-human IAA
    gold: single                   # single | majority
  schemas: [sentiment]             # annotation_scheme names to evaluate ([] = all)
  output:
    dir: judge_calibration_output

You can override most of these in the admin wizard and re-run.

Set temperature > 0. With k_samples > 1 and temperature 0 the samples are identical, confidence is always 1.0, and the calibration report is meaningless; Potato emits a startup warning in that case.

Supported annotation types

Type	Status	Metrics
`radio` / `select`	Supported	accuracy, P/R/F1, Cohen/Fleiss κ, Krippendorff α, ECE, confusion
`likert`	Supported	the above plus MAE and ordinal Krippendorff α
`multiselect`	Supported	per-label P/R/F1, mean Jaccard, exact-match accuracy, calibration
`span`	Experimental	IoU-matched P/R/F1, mean IoU, span-F1 agreement, span-level calibration

Span support clusters the judge's character-offset spans across the k samples and matches them to gold by intersection-over-union; its heuristics are directional, not exact.

What the report contains

Accuracy, precision, recall, F1 for each model against the human gold label.
Cohen's κ partitioned into human↔model, model↔model, and human↔human pairs.
Fleiss' κ and Krippendorff's α across all raters.
Expected Calibration Error (ECE), reliability bins, and Brier score, showing how well the vote-fraction confidence tracks correctness.
A confusion matrix per model against the human gold.

Metrics are computed over the overlap: items that both the models and the human(s) labeled, restricted to the calibration sample when one was drawn.

Output is written under output.dir: llm_labels.jsonl (one line per model, item, and schema), report.json, and a human-readable report.html.

Judge Calibration vs. Judge Alignment

Judge Calibration uses multiple judges, empirical confidence (the vote fraction across the k samples), and keeps the human strictly blind. Judge Alignment calibrates a single judge against existing human gold labels, shows its verdict inline during annotation, and is built around iterating on a rubric. Reach for calibration when you are vetting candidate judges; reach for alignment when you are tuning one judge against a fixed gold set.

Judge ↔ Human Alignment — single-judge inline calibration
Solo Mode — full human-LLM collaborative labeling
Inter-annotator agreement guide — the kappa and alpha metrics in depth

For implementation details, see the source documentation.