# LLM-as-Judge Calibration

Source: https://www.potatoannotator.com/docs/agent-evaluation/judge-calibration

**Judge Calibration auto-labels your data with one or more LLM judges, then calibrates them against blind human labels** so you can quantify how far to trust an [LLM-as-a-judge](https://huggingface.co/learn/cookbook/en/llm_judge). You write a judge prompt, pick the models, and Potato samples each one *k* times over your data. You then blind-label a sample without seeing the model answers, and Potato reports per-model accuracy, human↔model and model↔model agreement, [calibration error](https://en.wikipedia.org/wiki/Calibration_(statistics)), and confusion matrices.

Using an LLM to grade model outputs is now common in agent and model evaluation, but a judge is only useful if you know how well it tracks human judgment. Calibration is the measurement step that makes that trust defensible.

## How it works

```
SETUP → GENERATING → HUMAN_CALIBRATION → REPORT → COMPLETED
```

1. **Generating** — each model is queried *k* times per item. The modal label is the prediction; the fraction of the *k* samples that agree with it is the model's confidence. Results go to a dedicated store, never mixed into the annotation data, so humans cannot see them.
2. **Human calibration** — Potato draws a random or stratified sample of the labeled items, and one or more humans blind-label them through the normal annotation interface.
3. **Report** — metrics are computed over the human∩model overlap and written to the output directory.

Because the model labels live in a separate store and are never injected into the UI, blindness is structural rather than a matter of annotator discipline.

## Quick start

Run the included example from the repository root:

```bash
python potato/flask_server.py start examples/ai-assisted/judge-calibration/config.yaml -p 8000 --debug
```

- Open `http://localhost:8000/judge_calibration/admin` to configure and run.
- When generation finishes, blind-label the sample at `http://localhost:8000/annotate`.
- Click **Build report**, then open `http://localhost:8000/judge_calibration/report`.

The example uses a local [Ollama](https://ollama.com) model, so no API key is required. Start Ollama and run `ollama pull llama3.2:3b` first.

## Configuration

```yaml
judge_calibration:
  enabled: true
  prompt: |                       # supports {text}, {labels}, {description}
    You are an impartial expert annotator. Classify the sentiment as exactly
    one of: positive, negative, neutral.
  models:
    - endpoint_type: openai        # openai | anthropic | ollama | vllm | gemini | openrouter | huggingface
      model: gpt-4o-mini
      api_key: ${OPENAI_API_KEY}   # env-var expansion supported
      temperature: 0.7             # must be > 0 so the k samples vary
    - endpoint_type: ollama
      model: llama3.1:8b
      base_url: http://localhost:11434
      temperature: 0.7
  k_samples: 5                     # samples per model per item
  max_items: 1000                  # cap on items the LLMs label (null = all)
  sampling:
    strategy: stratified           # random | stratified | all
    sample_size: 200               # how many items humans blind-label
    seed: 42
  human:
    num_raters: 1                  # 1 = solo researcher; N adds human-human IAA
    gold: single                   # single | majority
  schemas: [sentiment]             # annotation_scheme names to evaluate ([] = all)
  output:
    dir: judge_calibration_output
```

You can override most of these in the admin wizard and re-run.

Set `temperature > 0`. With `k_samples > 1` and temperature 0 the samples are identical, confidence is always 1.0, and the calibration report is meaningless; Potato emits a startup warning in that case.

## Supported annotation types

| Type | Status | Metrics |
|------|--------|---------|
| `radio` / `select` | Supported | accuracy, P/R/F1, Cohen/Fleiss κ, Krippendorff α, ECE, confusion |
| `likert` | Supported | the above plus MAE and ordinal Krippendorff α |
| `multiselect` | Supported | per-label P/R/F1, mean Jaccard, exact-match accuracy, calibration |
| `span` | Experimental | IoU-matched P/R/F1, mean IoU, span-F1 agreement, span-level calibration |

Span support clusters the judge's character-offset spans across the *k* samples and matches them to gold by [intersection-over-union](https://en.wikipedia.org/wiki/Jaccard_index); its heuristics are directional, not exact.

## What the report contains

- **Accuracy, precision, recall, F1** for each model against the human gold label.
- **Cohen's κ** partitioned into human↔model, model↔model, and human↔human pairs.
- **Fleiss' κ** and **Krippendorff's α** across all raters.
- **Expected Calibration Error (ECE)**, reliability bins, and Brier score, showing how well the vote-fraction confidence tracks correctness.
- A **confusion matrix** per model against the human gold.

Metrics are computed over the overlap: items that both the models and the human(s) labeled, restricted to the calibration sample when one was drawn.

Output is written under `output.dir`: `llm_labels.jsonl` (one line per model, item, and schema), `report.json`, and a human-readable `report.html`.

## Judge Calibration vs. Judge Alignment

Judge Calibration uses **multiple** judges, **empirical** confidence (the vote fraction across the *k* samples), and keeps the human strictly blind. [Judge Alignment](/docs/agent-evaluation/judge-alignment) calibrates a **single** judge against existing human gold labels, shows its verdict inline during annotation, and is built around iterating on a rubric. Reach for calibration when you are vetting candidate judges; reach for alignment when you are tuning one judge against a fixed gold set.

## Related

- [Judge ↔ Human Alignment](/docs/agent-evaluation/judge-alignment) — single-judge inline calibration
- [Solo Mode](/docs/features/solo-mode) — full human-LLM collaborative labeling
- [Inter-annotator agreement guide](/docs/guides/inter-annotator-agreement) — the kappa and alpha metrics in depth

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/ai-intelligence/judge_calibration.md).
