Once you are evaluating agents at any scale, the constraint stops being "can we label this" and becomes "whose attention do we spend, and on what." You have thousands of production traces and a handful of reviewers. An LLM judge can pre-screen everything, but it is imperfect, and the cases where it is wrong are exactly the cases worth a human's time.

Two features in Potato 2.6 work together to manage that scarcity. A signal-based triage queue decides what humans see first. Judge-human alignment measures how much you can lean on the judge, and improves it. Run them together and you get an active evaluation loop: the judge handles the easy volume, the suspicious cases jump the queue to humans, and the disagreements feed back into a better judge.

This post covers both halves and how they connect.

A priority badge during annotation explaining why an item was flagged for review The triage queue badge in Potato

The triage half: worst first, not first-in-first-out

By default an annotation queue is FIFO: items are served in the order they loaded. That is the wrong order when review time is scarce. A clean trace and a trace where the agent threw an error are worth very different amounts of human attention, and FIFO treats them the same.

The triage queue reorders the queue by a per-item quality signal. The signal can be an agent error, a production thumbs-down, a low automated score, or any field in your data:

yaml

triage:
  enabled: true
  order: desc            # high priority first (default)
  show_badge: true       # banner during annotation explaining the priority
  rules:                 # evaluated in order; highest matching priority wins
    - name: "Agent errored"
      priority: 100
      when:
        field: status
        equals: error
    - name: "Negative feedback"
      priority: 80
      when:
        field: feedback
        in: [thumbs_down, negative]
    - name: "Low quality score"
      priority: 60
      when:
        field: score
        lt: 0.5
 
assignment_strategy: priority

Rules are evaluated top to bottom and the highest matching priority wins, so an errored trace that also has negative feedback still lands at 100. If you skip rules entirely, Potato falls back to a sensible default set (error status at 100, negative feedback at 80, score below 0.5 at 60), so the turnkey behavior is reasonable before you tune anything.

The condition operators cover the comparisons you actually need:

Operator	Meaning
`equals`	exact match (strings are case-insensitive)
`in`	value is one of a list
`contains`	list contains, or substring match
`lt` / `lte` / `gt` / `gte`	numeric comparison
`exists`	field present or absent

When the signal is already a number, you can read it straight off the field instead of writing rules:

yaml

triage:
  enabled: true
  signal_field: quality_score
  invert_signal: true           # lower score => higher priority

It works on live traffic too

The priority score is computed once when an item loads or is ingested, then stored on the item, so assignment stays cheap. That same design means runtime ingestion just works: a trace pushed in mid-session over the webhook endpoint or a Langfuse poller is scored as it arrives and slots into the priority order. A low-scoring or errored trace that lands at 2pm jumps ahead of the clean ones still waiting from this morning. Setting assignment_strategy: priority is what makes the queue actually serve in that order; show_badge is independent, so the "why was this flagged" banner shows even if you keep a different strategy.

The alignment half: how much to trust the judge

Triage decides what humans see. Alignment decides how much of the rest you can hand to the judge unsupervised, and it tightens the judge over time.

Judge Alignment runs a configurable LLM judge over instances your annotators have already labeled, then reports Cohen's κ, a confusion matrix, and a disagreement list against the human gold. The standard practice (align a judge to roughly 100–200 gold labels, inspect where it disagrees, rewrite the rubric, and re-run) is the loop this is built around.

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"
  ai_config:
    model: "llama3.2"
    temperature: 0.0
 
judge_alignment:
  enabled: true
  schemas:
    correctness:
      rubric: >
        Label 'correct' only if the agent's answer is factually right and fully
        satisfies the request; otherwise 'incorrect'.
  inline:
    enabled: true                # show the judge verdict beside the human label
    schemas: [correctness]

You run the judge from the admin API, and predictions are cached per prompt version so re-runs are cheap:

bash

curl -X POST localhost:8000/admin/api/judge-alignment/run \
  -H "X-API-Key: <admin-key>" \
  -H "Content-Type: application/json" \
  -d '{"max_per_schema": 200}'

When you want to calibrate, pass an edited rubric. That creates a new prompt version, so you can compare κ across rounds and actually see whether your rewrite helped:

bash

curl -X POST localhost:8000/admin/api/judge-alignment/run \
  -H "X-API-Key: <admin-key>" -H "Content-Type: application/json" \
  -d '{"rubrics": {"correctness": "Stricter rubric text..."}}'

The report, available as JSON or a rendered page at /admin/judge-alignment, shows κ with a Landis–Koch interpretation, the confusion matrix, a disagreement table with the judge's reasoning, and a prompt-version history so calibration progress is visible across rounds.

Inline mode puts it in front of the annotator

With inline.enabled, each annotation page shows the judge's cached verdict next to the human label (its choice, confidence, and expandable reasoning) alongside a running κ for the task. "Accept" fills the matching choice. Every human save records a human↔judge comparison that feeds the running agreement, so the κ you are tuning toward updates as people work.

Putting the two together

The features are designed to compose into one loop:

Production signals feed a priority queue; humans review the top items; their labels measure judge kappa; a refined rubric feeds back The active evaluation loop: triage, human review, judge alignment, rubric refinement

Triage pushes errored and low-confidence traces to the front of the human queue.
Humans review those high-value items, producing fresh gold labels exactly where the system is least sure.
Alignment scores the judge against that gold, and the disagreement list shows precisely where the judge and the humans part ways.
You refine the rubric, re-run, and watch κ move, then let the better-calibrated judge absorb more of the easy volume so human time keeps flowing to the hard cases.

Each turn of the loop spends human attention where it is worth the most and converts it into a judge you can trust a little further. That is the whole point: not to remove people from agent evaluation, but to aim them.

Both features ship in Potato 2.6. See the triage queue docs and the judge alignment docs for the full reference, and the eval_trace display for reading the prioritized traces quickly.