# Judge ↔ Human Alignment

Source: https://www.potatoannotator.com/docs/agent-evaluation/judge-alignment

**Judge Alignment measures and tunes how well an LLM judge agrees with your human gold labels.** Potato runs a configurable [LLM-as-a-judge](https://huggingface.co/learn/cookbook/en/llm_judge) over instances your annotators have already labeled, computes [Cohen's κ](https://en.wikipedia.org/wiki/Cohen%27s_kappa), a confusion matrix, and a disagreement list, and tracks κ as you edit the judge rubric. With inline mode on, the judge's verdict appears beside the human label during annotation, with a running κ.

This is the standard "align your judge to roughly 100–200 gold labels" loop used by tools like LangSmith Align Evals and Evidently: collect human labels, run the judge, inspect disagreements, refine the rubric, and re-run until agreement is high.

![Inline judge suggestion beside the human label](/images/docs/judge-inline.png "An LLM judge verdict shown next to the human annotation with a running kappa")

## Configuration

```yaml
# The judge uses Potato's standard AI endpoint machinery.
ai_support:
  enabled: true
  endpoint_type: "ollama"        # ollama (local) | openai | anthropic | vllm | ...
  ai_config:
    model: "llama3.2"
    temperature: 0.0
    # openai/anthropic: add api_key: "<key>"

judge_alignment:
  enabled: true
  schemas:
    correctness:                 # per annotation-scheme rubric (editable)
      rubric: >
        Label 'correct' only if the agent's answer is factually right and fully
        satisfies the request; otherwise 'incorrect'.
  few_shot:
    enabled: false               # seed the judge prompt with gold examples
    max_examples: 4              # drawn from high-agreement human labels
    min_agreement: 0.8
  inline:
    enabled: true                # show the judge verdict beside the human label
    schemas: [correctness]
    compute_on_demand: false     # call the judge live when no cached verdict exists
```

Scope is single-choice categorical schemes (`radio`, `select`, `likert`). If `judge_alignment.schemas` is set, only those schemes are judged; otherwise all categorical schemes are.

## Running the judge

Run the judge from the admin API. Predictions are cached per prompt version, so re-runs are cheap:

```bash
# Generate or refresh judge verdicts over human-annotated instances
curl -X POST localhost:8000/admin/api/judge-alignment/run \
  -H "X-API-Key: <admin-key>" \
  -H "Content-Type: application/json" \
  -d '{"max_per_schema": 200}'
```

To calibrate, pass an edited rubric. That creates a new prompt version, so you can compare κ across rounds:

```bash
curl -X POST localhost:8000/admin/api/judge-alignment/run \
  -H "X-API-Key: <admin-key>" -H "Content-Type: application/json" \
  -d '{"rubrics": {"correctness": "Stricter rubric text..."}}'
```

## The alignment report

```
GET /admin/judge-alignment                      # JSON
GET /admin/judge-alignment?format=html          # rendered page
GET /admin/judge-alignment?prompt_version=v_abc123
```

Send the `X-API-Key` header. Per schema, the report shows:

- **Cohen's κ** with a [Landis–Koch](https://en.wikipedia.org/wiki/Fleiss%27_kappa#Interpretation) interpretation, the agreement rate, and the number of instances compared.
- A **confusion matrix** (rows are human gold, columns are the judge).
- A **disagreement table** with the instance, human label, judge label, confidence, and judge reasoning.
- **Prompt-version history** with mean κ per version, so calibration progress is visible.

Human gold is the majority vote across annotators for each instance.

## Inline mode

With `inline.enabled`, each annotation page shows the judge's cached verdict for the instance — its label, confidence, and expandable reasoning — alongside a running κ for the task. "Accept" fills the matching choice. Every human save records a human↔judge comparison that feeds the running agreement. Set `compute_on_demand: true` to call the judge live when no cached verdict exists; otherwise pre-run the batch, which is faster.

## Notes and limitations

- Calibration is manual in this version: edit the rubric and re-run. Automated prompt optimization is out of scope.
- Scope is single-choice categorical schemes. Span and free-text judging is future work.
- Run the judge over a focused gold set of roughly 100–200 labeled instances for a stable κ.

## Related

- [LLM-as-Judge Calibration](/docs/agent-evaluation/judge-calibration) — multi-judge, blind-human calibration with calibration error
- [Triage Queue](/docs/agent-evaluation/triage-queue) — route the most informative items to humans first
- [Inter-annotator agreement guide](/docs/guides/inter-annotator-agreement) — the kappa metrics in depth

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/agent-evaluation/judge_alignment.md).