LLM-as-Judge Calibration
Auto-label data with one or more LLM judges, then run a blind human calibration pass to measure accuracy, agreement, and calibration error. Answers "should I trust this LLM judge?" with a defensible, reproducible workflow.
Judge Calibration auto-labels your data with one or more LLM judges, then calibrates them against blind human labels so you can quantify how far to trust an LLM-as-a-judge. You write a judge prompt, pick the models, and Potato samples each one k times over your data. You then blind-label a sample without seeing the model answers, and Potato reports per-model accuracy, human↔model and model↔model agreement, calibration error, and confusion matrices.
Using an LLM to grade model outputs is now common in agent and model evaluation, but a judge is only useful if you know how well it tracks human judgment. Calibration is the measurement step that makes that trust defensible.
How it works
SETUP → GENERATING → HUMAN_CALIBRATION → REPORT → COMPLETED
- Generating — each model is queried k times per item. The modal label is the prediction; the fraction of the k samples that agree with it is the model's confidence. Results go to a dedicated store, never mixed into the annotation data, so humans cannot see them.
- Human calibration — Potato draws a random or stratified sample of the labeled items, and one or more humans blind-label them through the normal annotation interface.
- Report — metrics are computed over the human∩model overlap and written to the output directory.
Because the model labels live in a separate store and are never injected into the UI, blindness is structural rather than a matter of annotator discipline.
Quick start
Run the included example from the repository root:
python potato/flask_server.py start examples/ai-assisted/judge-calibration/config.yaml -p 8000 --debug- Open
http://localhost:8000/judge_calibration/adminto configure and run. - When generation finishes, blind-label the sample at
http://localhost:8000/annotate. - Click Build report, then open
http://localhost:8000/judge_calibration/report.
The example uses a local Ollama model, so no API key is required. Start Ollama and run ollama pull llama3.2:3b first.
Configuration
judge_calibration:
enabled: true
prompt: | # supports {text}, {labels}, {description}
You are an impartial expert annotator. Classify the sentiment as exactly
one of: positive, negative, neutral.
models:
- endpoint_type: openai # openai | anthropic | ollama | vllm | gemini | openrouter | huggingface
model: gpt-4o-mini
api_key: ${OPENAI_API_KEY} # env-var expansion supported
temperature: 0.7 # must be > 0 so the k samples vary
- endpoint_type: ollama
model: llama3.1:8b
base_url: http://localhost:11434
temperature: 0.7
k_samples: 5 # samples per model per item
max_items: 1000 # cap on items the LLMs label (null = all)
sampling:
strategy: stratified # random | stratified | all
sample_size: 200 # how many items humans blind-label
seed: 42
human:
num_raters: 1 # 1 = solo researcher; N adds human-human IAA
gold: single # single | majority
schemas: [sentiment] # annotation_scheme names to evaluate ([] = all)
output:
dir: judge_calibration_outputYou can override most of these in the admin wizard and re-run.
Set temperature > 0. With k_samples > 1 and temperature 0 the samples are identical, confidence is always 1.0, and the calibration report is meaningless; Potato emits a startup warning in that case.
Supported annotation types
| Type | Status | Metrics |
|---|---|---|
radio / select | Supported | accuracy, P/R/F1, Cohen/Fleiss κ, Krippendorff α, ECE, confusion |
likert | Supported | the above plus MAE and ordinal Krippendorff α |
multiselect | Supported | per-label P/R/F1, mean Jaccard, exact-match accuracy, calibration |
span | Experimental | IoU-matched P/R/F1, mean IoU, span-F1 agreement, span-level calibration |
Span support clusters the judge's character-offset spans across the k samples and matches them to gold by intersection-over-union; its heuristics are directional, not exact.
What the report contains
- Accuracy, precision, recall, F1 for each model against the human gold label.
- Cohen's κ partitioned into human↔model, model↔model, and human↔human pairs.
- Fleiss' κ and Krippendorff's α across all raters.
- Expected Calibration Error (ECE), reliability bins, and Brier score, showing how well the vote-fraction confidence tracks correctness.
- A confusion matrix per model against the human gold.
Metrics are computed over the overlap: items that both the models and the human(s) labeled, restricted to the calibration sample when one was drawn.
Output is written under output.dir: llm_labels.jsonl (one line per model, item, and schema), report.json, and a human-readable report.html.
Judge Calibration vs. Judge Alignment
Judge Calibration uses multiple judges, empirical confidence (the vote fraction across the k samples), and keeps the human strictly blind. Judge Alignment calibrates a single judge against existing human gold labels, shows its verdict inline during annotation, and is built around iterating on a rubric. Reach for calibration when you are vetting candidate judges; reach for alignment when you are tuning one judge against a fixed gold set.
Related
- Judge ↔ Human Alignment — single-judge inline calibration
- Solo Mode — full human-LLM collaborative labeling
- Inter-annotator agreement guide — the kappa and alpha metrics in depth
For implementation details, see the source documentation.