Closing the Loop: Routing Agent Errors and Judge Disagreements Back to Humans
Human review time is the scarcest resource in agent evaluation. Potato 2.6 pairs a signal-based triage queue with judge-human alignment so the worst traces reach people first and your LLM judge keeps getting better.
Once you are evaluating agents at any scale, the constraint stops being "can we label this" and becomes "whose attention do we spend, and on what." You have thousands of production traces and a handful of reviewers. An LLM judge can pre-screen everything, but it is imperfect, and the cases where it is wrong are exactly the cases worth a human's time.
Two features in Potato 2.6 work together to manage that scarcity. A signal-based triage queue decides what humans see first. Judge-human alignment measures how much you can lean on the judge, and improves it. Run them together and you get an active evaluation loop: the judge handles the easy volume, the suspicious cases jump the queue to humans, and the disagreements feed back into a better judge.
This post covers both halves and how they connect.
The triage queue badge in Potato
The triage half: worst first, not first-in-first-out
By default an annotation queue is FIFO: items are served in the order they loaded. That is the wrong order when review time is scarce. A clean trace and a trace where the agent threw an error are worth very different amounts of human attention, and FIFO treats them the same.
The triage queue reorders the queue by a per-item quality signal. The signal can be an agent error, a production thumbs-down, a low automated score, or any field in your data:
triage:
enabled: true
order: desc # high priority first (default)
show_badge: true # banner during annotation explaining the priority
rules: # evaluated in order; highest matching priority wins
- name: "Agent errored"
priority: 100
when:
field: status
equals: error
- name: "Negative feedback"
priority: 80
when:
field: feedback
in: [thumbs_down, negative]
- name: "Low quality score"
priority: 60
when:
field: score
lt: 0.5
assignment_strategy: priorityRules are evaluated top to bottom and the highest matching priority wins, so an errored trace that also has negative feedback still lands at 100. If you skip rules entirely, Potato falls back to a sensible default set (error status at 100, negative feedback at 80, score below 0.5 at 60), so the turnkey behavior is reasonable before you tune anything.
The condition operators cover the comparisons you actually need:
| Operator | Meaning |
|---|---|
equals | exact match (strings are case-insensitive) |
in | value is one of a list |
contains | list contains, or substring match |
lt / lte / gt / gte | numeric comparison |
exists | field present or absent |
When the signal is already a number, you can read it straight off the field instead of writing rules:
triage:
enabled: true
signal_field: quality_score
invert_signal: true # lower score => higher priorityIt works on live traffic too
The priority score is computed once when an item loads or is ingested, then stored on the item, so assignment stays cheap. That same design means runtime ingestion just works: a trace pushed in mid-session over the webhook endpoint or a Langfuse poller is scored as it arrives and slots into the priority order. A low-scoring or errored trace that lands at 2pm jumps ahead of the clean ones still waiting from this morning. Setting assignment_strategy: priority is what makes the queue actually serve in that order; show_badge is independent, so the "why was this flagged" banner shows even if you keep a different strategy.
The alignment half: how much to trust the judge
Triage decides what humans see. Alignment decides how much of the rest you can hand to the judge unsupervised, and it tightens the judge over time.
Judge Alignment runs a configurable LLM judge over instances your annotators have already labeled, then reports Cohen's κ, a confusion matrix, and a disagreement list against the human gold. The standard practice (align a judge to roughly 100–200 gold labels, inspect where it disagrees, rewrite the rubric, and re-run) is the loop this is built around.
ai_support:
enabled: true
endpoint_type: "ollama"
ai_config:
model: "llama3.2"
temperature: 0.0
judge_alignment:
enabled: true
schemas:
correctness:
rubric: >
Label 'correct' only if the agent's answer is factually right and fully
satisfies the request; otherwise 'incorrect'.
inline:
enabled: true # show the judge verdict beside the human label
schemas: [correctness]You run the judge from the admin API, and predictions are cached per prompt version so re-runs are cheap:
curl -X POST localhost:8000/admin/api/judge-alignment/run \
-H "X-API-Key: <admin-key>" \
-H "Content-Type: application/json" \
-d '{"max_per_schema": 200}'When you want to calibrate, pass an edited rubric. That creates a new prompt version, so you can compare κ across rounds and actually see whether your rewrite helped:
curl -X POST localhost:8000/admin/api/judge-alignment/run \
-H "X-API-Key: <admin-key>" -H "Content-Type: application/json" \
-d '{"rubrics": {"correctness": "Stricter rubric text..."}}'The report, available as JSON or a rendered page at /admin/judge-alignment, shows κ with a Landis–Koch interpretation, the confusion matrix, a disagreement table with the judge's reasoning, and a prompt-version history so calibration progress is visible across rounds.
Inline mode puts it in front of the annotator
With inline.enabled, each annotation page shows the judge's cached verdict next to the human label (its choice, confidence, and expandable reasoning) alongside a running κ for the task. "Accept" fills the matching choice. Every human save records a human↔judge comparison that feeds the running agreement, so the κ you are tuning toward updates as people work.
Putting the two together
The features are designed to compose into one loop:
The active evaluation loop: triage, human review, judge alignment, rubric refinement
- Triage pushes errored and low-confidence traces to the front of the human queue.
- Humans review those high-value items, producing fresh gold labels exactly where the system is least sure.
- Alignment scores the judge against that gold, and the disagreement list shows precisely where the judge and the humans part ways.
- You refine the rubric, re-run, and watch κ move, then let the better-calibrated judge absorb more of the easy volume so human time keeps flowing to the hard cases.
Each turn of the loop spends human attention where it is worth the most and converts it into a judge you can trust a little further. That is the whole point: not to remove people from agent evaluation, but to aim them.
Both features ship in Potato 2.6. See the triage queue docs and the judge alignment docs for the full reference, and the eval_trace display for reading the prioritized traces quickly.