# Can You Trust Your LLM Judge? Calibrating LLM-as-Judge Against Humans

Source: https://www.potatoannotator.com/blog/trust-your-llm-judge-calibration

Using a large language model to grade other model outputs has become the default move in evaluation. You write a rubric, ask GPT-4o or Claude to score a thousand responses, and read off an accuracy number. It is fast, it is cheap, and it scales past anything a human team can label by hand.

It also quietly assumes the thing you most need to check: that the judge agrees with people. An [LLM-as-a-judge](https://huggingface.co/learn/cookbook/en/llm_judge) that is confidently wrong produces a clean-looking leaderboard built on sand. Before you trust a judge's verdicts, you have to measure how well they track human judgment. That measurement step is calibration, and **Potato 2.6** adds a workflow for it.

This post covers Judge Calibration: how it samples models, how it keeps the human pass honest, and what the report actually tells you. The [reference docs](/docs/agent-evaluation/judge-calibration) have the complete option list.

![An LLM judge verdict shown beside a human annotation, with a running kappa](/images/docs/judge-inline.png "Inline judge calibration in Potato")

## The shape of the problem

A judge can fail in two different ways, and you want to catch both.

The first is **disagreement**: the judge calls things "correct" that a careful human would call "wrong." That is what accuracy and agreement metrics measure.

The second is **bad confidence**: the judge says it is 95% sure and is right 60% of the time. A judge can have decent accuracy and still be badly miscalibrated, which matters the moment you start using its confidence to route work or set thresholds. That is what [calibration error](https://en.wikipedia.org/wiki/Calibration_(statistics)) measures.

Potato's calibration pass is built to surface both at once.

## How it works

The workflow runs as a short state machine:

```
SETUP → GENERATING → HUMAN_CALIBRATION → REPORT → COMPLETED
```

**Generating.** Each model is queried *k* times per item. The modal label across those *k* samples is the model's prediction, and the fraction of samples that agree with it is the model's confidence. Sampling *k* times instead of once is what gives you an empirical confidence signal rather than a number the model made up about itself. These results go to a dedicated store and are never written into the annotation data.

**Human calibration.** Potato draws a random or stratified sample of the items and routes them to one or more humans, who label them through the normal annotation interface, without ever seeing the model's answers.

**Report.** Metrics are computed over the overlap between what the models labeled and what the humans labeled, then written to disk.

The blindness here is the important part. Because the model labels live in a separate store and are never injected into the UI, the human cannot be anchored by them even accidentally. Blindness is structural, not a matter of asking annotators to look away.

![The calibration pipeline: k-sample model voting, a blind human pass, and a comparison report](/images/blog/judge-calibration-pipeline.svg "How Potato calibrates a judge against blind human labels")

## Configuration

A judge calibration is one config block. You write the judge prompt, list the models, and set how many times to sample each:

```yaml
judge_calibration:
  enabled: true
  prompt: |                       # supports {text}, {labels}, {description}
    You are an impartial expert annotator. Classify the sentiment as exactly
    one of: positive, negative, neutral.
  models:
    - endpoint_type: openai        # openai | anthropic | ollama | vllm | gemini | ...
      model: gpt-4o-mini
      api_key: ${OPENAI_API_KEY}
      temperature: 0.7             # must be > 0 so the k samples vary
    - endpoint_type: ollama
      model: llama3.1:8b
      base_url: http://localhost:11434
      temperature: 0.7
  k_samples: 5                     # samples per model per item
  max_items: 1000                  # cap on items the LLMs label (null = all)
  sampling:
    strategy: stratified           # random | stratified | all
    sample_size: 200               # how many items humans blind-label
    seed: 42
  human:
    num_raters: 1                  # 1 = solo researcher; N adds human-human IAA
    gold: single                   # single | majority
  schemas: [sentiment]
  output:
    dir: judge_calibration_output
```

> **Warning:** Set `temperature > 0`. With `k_samples > 1` and temperature 0, the samples are identical, confidence is pinned at 1.0, and the calibration report is meaningless. Potato emits a startup warning when it sees that combination.

You can list more than one model and calibrate them side by side, which is the natural way to choose between a cheap local judge and an expensive hosted one.

## Trying it without an API key

The bundled example uses a local [Ollama](https://ollama.com) model, so you can run the whole loop offline. Start Ollama, pull the model, and launch:

```bash
ollama pull llama3.2:3b
python potato/flask_server.py start examples/ai-assisted/judge-calibration/config.yaml -p 8000 --debug
```

Open `http://localhost:8000/judge_calibration/admin` to configure and run, blind-label the sample at `/annotate`, then build the report and read it at `/judge_calibration/report`.

## What the report tells you

The report is built to answer "should I trust this judge?" with numbers you can put in a methods section:

- **Accuracy, precision, recall, F1** for each model against the human gold label.
- **Cohen's κ** broken out into human↔model, model↔model, and human↔human pairs, so you can see whether the judge agrees with people as well as people agree with each other.
- **Fleiss' κ** and **Krippendorff's α** across all raters.
- **Expected Calibration Error (ECE)**, reliability bins, and a Brier score: the answer to the bad-confidence failure mode.
- A **confusion matrix** per model, which usually tells the real story: a judge that is fine on the easy classes and falls apart on one hard distinction.

Everything is computed over the overlap: items both the models and the humans labeled, restricted to the calibration sample. Output lands under `output.dir` as `llm_labels.jsonl`, `report.json`, and a readable `report.html`.

## What it handles

Calibration is fully supported on the categorical schemes most judges use, and reaches into harder types:

| Type | Status | Metrics |
|------|--------|---------|
| `radio` / `select` | Supported | accuracy, P/R/F1, Cohen/Fleiss κ, Krippendorff α, ECE, confusion |
| `likert` | Supported | the above plus MAE and ordinal Krippendorff α |
| `multiselect` | Supported | per-label P/R/F1, mean Jaccard, exact-match accuracy, calibration |
| `span` | Experimental | [IoU](https://en.wikipedia.org/wiki/Jaccard_index)-matched P/R/F1, mean IoU, span-F1, span-level calibration |

Span calibration clusters the judge's character-offset spans across the *k* samples and matches them to gold by intersection-over-union; treat its numbers as directional rather than exact.

## Calibration versus alignment

Potato ships a second, related workflow that is easy to confuse with this one. [Judge Alignment](/docs/agent-evaluation/judge-alignment) calibrates a **single** judge against an existing human gold set, shows its verdict inline during annotation, and is built around iterating on a rubric until agreement climbs.

The rule of thumb: reach for **calibration** when you are vetting candidate judges and want a blind, empirical confidence number; reach for **alignment** when you have settled on one judge and are tuning its rubric against a fixed gold set. The two are covered together in [Closing the Loop](/docs/agent-evaluation/triage-queue).

LLM judges are not going away; there is too much to evaluate and too few people to do it by hand. The point of calibration is not to replace the judge with humans, but to know, with a number, exactly how far the judge can be trusted before a human has to look.

The [Judge Calibration docs](/docs/agent-evaluation/judge-calibration) cover every option, and the [inter-annotator agreement guide](/docs/guides/inter-annotator-agreement) explains the kappa and alpha metrics in depth.
