Skip to content
Announcements6 min read

Potato 2.6: Qualitative Data Analysis Meets Agent Evaluation

Potato 2.6 is out: QDA Mode for qualitative coding, an LLM-as-judge calibration and alignment workflow, trajectory editing that produces SFT and DPO training data, a 3x faster boot, and a relicense to GPL-3.0-or-later.

Potato Team

Potato 2.6 is out. It is a release with two centers of gravity. On one side, it brings Potato into qualitative data analysis, the world of interview transcripts, codebooks, and memos that has lived in proprietary desktop tools. On the other, it deepens the agent-evaluation toolkit toward producing training data, not just scores. And underneath both, it gets meaningfully faster and changes its license.

If you have been following along, the last few weeks of posts previewed these features one at a time. This is the release that ships them.

Potato 2.6, split between Qualitative Data Analysis and Agent EvaluationPotato 2.6

A note up front: Potato is now GPL-3.0

Potato is relicensed to GPL-3.0-or-later, from PolyForm Shield. This is the kind of change that is easy to bury and shouldn't be, because it changes what you are allowed to do with the project.

Under GPL-3.0-or-later you can use, modify, and redistribute Potato, including commercially, as long as derivative works stay under the GPL. The old PolyForm Shield license carried a non-compete restriction that made some adopters (and their legal teams) hesitate. GPL is a license those teams already understand. If a licensing question was holding back your lab or company, that question now has a familiar answer. See the About page for the details.

QDA Mode

The headline for qualitative researchers is QDA Mode: a single switch that turns Potato into a collaborative qualitative-coding workspace.

yaml
qda_mode:
  enabled: true            # codebook + memos + cases + search
codebook_invivo_key: i     # mint a code from a text selection

Enabling it composes a living codebook, in-vivo coding, analyst memos, cases, and full-text search, with defaults tuned for one analyst coding a whole corpus. You can grow and reorganize the codebook as you read, mint a code straight from a highlighted passage with a keystroke, attach private or shared memos to any excerpt, group excerpts into cases by participant, and run FTS5 search across the corpus. It is a free, open-source, web-based alternative to NVivo, ATLAS.ti, MAXQDA, and Dedoose, sitting in the same tool as the rest of your annotation work.

We wrote about the design in Bringing Qualitative Coding to Potato. Full reference: QDA Mode docs.

LLM-as-judge: calibration, alignment, and triage

Using an LLM to grade model outputs is now routine. Knowing how far to trust it is the part 2.6 addresses, with three features that work together.

Judge Calibration auto-labels your data with one or more LLM judges, samples each k times for an empirical confidence, then runs a blind human pass (annotators never see the model labels) and reports accuracy, Cohen's and Fleiss' kappa, Krippendorff's alpha, and Expected Calibration Error. It answers "should I trust this judge?" with numbers you can defend. We covered it in Can You Trust Your LLM Judge?.

Judge Alignment tunes a single judge against your human gold labels, tracking Cohen's kappa as you refine the rubric, with an optional inline verdict shown beside the human label during annotation.

The Triage Queue prioritizes the annotation queue by a per-item signal (an agent error, a production thumbs-down, a low score) so reviewers see the most-suspect traces first instead of in arrival order.

yaml
triage:
  enabled: true
  signal_field: quality_score
  invert_signal: true
assignment_strategy: priority

Alignment and triage compose into an active evaluation loop, which we walked through in Closing the Loop. Reference docs: Judge Calibration, Judge Alignment, Triage Queue.

Trajectory editing for SFT and DPO

The new trajectory_edit schema lets annotators rewrite the steps of an agent trace, with a live word-level diff, to fix a wrong reasoning step, repair a tool call, or strengthen the final answer. The trajectory_correction exporter then turns each original/corrected pair into training data: supervised fine-tuning targets in trajectory_sft.jsonl and DPO preference pairs in trajectory_dpo.jsonl. Unedited traces are skipped, since training on an unchanged trajectory teaches nothing.

This makes Potato a training-data production tool, not only an evaluation tool. The full walkthrough is in From Evaluation to Training Data; reference in the trajectory editing docs.

The eval_trace display

Reading agent traces quickly is its own problem. The new eval_trace display splits a single trace into three synchronized panes (Reasoning, Function Calls, and Final Answer) so an evaluator sees what the agent thought, did, and produced at a glance. It is built for continuous evaluation, where traces arrive over a webhook, a Langfuse poller, or a watched directory and have to be judged as they land. See the eval_trace docs.

Workflow and deployment

A batch of operational features rounds out the release:

  • Heterogeneous coverage. Assign different numbers of annotators to different items: one on most, three on a stratified sample, with adaptive disagreement boosts and automatic adjudication routing. Covered in Beyond Full Overlap; reference in the heterogeneous coverage docs.
  • Reclaim abandoned assignments. Recover items left by Prolific or QC-blocked workers, with configurable retention and idempotent reclaim. See task assignment.
  • Custom Batch assignment. Assign predefined batches of items to specific annotators, built for repeat-round study designs.
  • Reverse-proxy URL prefixes. Serve Potato under a sub-path behind a reverse proxy. See the reverse proxy docs.

Faster, and a schema rename

Two changes affect every project.

Boot is roughly 3x faster. The machine-learning stack is no longer eager-loaded at startup; it loads on first use instead. Import time dropped from about 6.5 seconds to 2, a 50,000-item boot from about 10 seconds to 5.7, and resident memory from about 750MB to 365MB. Container restarts are quicker and the memory footprint for horizontal scaling is roughly halved.

annotation_type: highlight is now span. A migration is in place, and existing span configs are unaffected. Update old configs by renaming the type. "span" is the standard term across NLP, and the rename brings the annotation type in line with it.

Catching up: 2.5 and 2.4.5

A couple of releases between 2.4 and 2.6 went out without a post here. The highlights are worth calling out, since the qualitative-coding work in particular underpins QDA Mode:

2.5.0 was the qualitative-coding wave. It added Cohen's kappa and Fleiss' kappa alongside Krippendorff's alpha, the codebook and quotation_report exporters, and admin analytics for code co-occurrence and a codes-by-attribute crosstab. These are the reliability and export pieces QDA Mode builds on.

2.4.5 brought a validated-refinement framework for improving annotation guidelines in solo mode, a config validator CLI (python -m potato.validate_cli), and a security fix for a path-traversal bypass (GHSA-q9m2-fhv9-3jcf). If you are on an older 2.4.x, upgrading picks up that fix.

The complete history lives on the What's New page.

Getting it

bash
pip install --upgrade potato-annotation

Then point Potato at one of the bundled examples (examples/advanced/qda-mode-example/, examples/ai-assisted/judge-calibration/, examples/agent-traces/trajectory-correction/) to see the new surfaces running. Each release here started as a question from someone using the tool; if 2.6 raises one for you, the GitHub repository is the place to ask.