For decades a codebook was a document you handed to people. It told a team of coders what each label meant, which cases counted, and where the tricky boundaries were. Now the coder reading it is often a language model, and a codebook written for a graduate student on their third training session does not transfer cleanly to a model that will read the whole thing once and never ask a clarifying question.

A codebook is the contract between your labels and the world: for each code, what it means, what counts, what does not, and an example or two. To use an LLM as the annotator you rewrite that contract so a model can execute it without back-and-forth, then you check its output against human coders before you trust it. This post is about the rewrite and the check. The Potato config at the end shows one way to run the whole loop.

What a codebook actually is

In content analysis and qualitative research, a codebook is the shared definition of every code in a study. The canonical reference is MacQueen and colleagues' 1998 template, which gives each code a name, a short definition, a fuller description of when it applies and when it does not, and example passages. The point of writing all that down is reliability: two coders who read the same codebook should reach the same label on the same text, and you can measure whether they do.

Codebooks come in two temperaments. A fixed codebook is settled before coding starts, which is how most machine-learning label sets and crowdsourced tasks work. A living codebook grows as you read, in the grounded-theory tradition: you notice a recurring idea, name it, and later merge two codes once you see they are the same thing. Both can drive an LLM annotator, but they fail in different ways, which is worth keeping in mind.

Why a codebook built for humans breaks an LLM

A human coder fills gaps in a codebook with judgment and with the training session where you talked through hard cases. A model has neither. It reads the words on the page, and the places you left implicit are exactly where it goes wrong.

Undefined boundaries. "Code cost concerns when the participant mentions money" leaves out whether an offhand "it wasn't cheap" counts. A person asks; a model guesses, and guesses inconsistently across a corpus.
Missing negative examples. Human coders learn as much from "this looks like X but isn't" as from positive cases. Codebooks rarely write those down because the trainer supplies them out loud.
Instruction literalism. Tell a model "apply up to three codes" and it will often apply three whether or not the text warrants them, padding to the cap. Humans read that as a ceiling; models read it as a target.
No segmentation step. A person naturally chunks a transcript into codeable units. A model needs to be told to segment first, then code each unit, or it will code the whole passage as one blob and lose the granularity you wanted.

None of these are reasons to avoid LLM annotators. They are the edits that turn a human codebook into one a model can run.

Writing a codebook an LLM can follow

The rewrite is mechanical once you know what to add. For each code, spell out the four things a human would otherwise infer:

A one-line definition in plain language, not a synonym for the code name.
Inclusion rules: the signals that mean the code applies.
Exclusion rules: the near-misses that do not count, with a short example of each.
Two or three real examples, ideally including one that a naive reader would code wrong.

Then handle structure separately from labels. Ask the model to segment the text into units first, code each unit against the codebook, and abstain when nothing fits rather than reaching for the closest code. An explicit "none of these" option does more for data quality than another paragraph of instructions, because it gives the model a place to put the cases your codebook does not cover, instead of forcing them into a code that does not belong.

The reliability question

You check an LLM annotator not because it is bad, but because you cannot tell in advance. On several established tasks it is genuinely good: Gilardi, Alizadeh, and Kubli (2023) found ChatGPT matched or beat crowd workers on relevance, stance, and frame detection, with higher intercoder agreement and a per-label cost under a cent. But "good on those tasks" tells you nothing about your task, and the only way to find out is to measure.

The measurement is the same one you would run on two human coders. Have people label a sample, have the model label the same sample, and compute an agreement statistic such as Cohen's or Krippendorff's that corrects for chance. Where agreement is high, the model can carry the bulk of the corpus. Where it is low, you have found a code whose definition is doing less work than you thought, and the fix is usually in the codebook, not the model.

A codebook feeds a prompt and schema, an LLM applies codes to each unit, a human verifies a sample against gold items, and gaps flow back to refine the codebook. From a human codebook to an LLM annotator, with the refinement loop

Two failure modes deserve their own watch. Models over-apply codes when you give them a cap and no reason to stay under it, so agreement can look fine on presence and fall apart on count. And when a human verifies the model instead of labeling fresh, automation bias creeps in: it is faster to accept a plausible code than to challenge it, so the verifier quietly ratifies the model's mistakes. Both are reasons to keep a slice of genuinely blind, human-only labels as your yardstick.

Keeping a human in the loop

The workable arrangement is not "the model labels everything" or "people label everything." It is a split you decide from the agreement number.

A validation loop: measure agreement on a gold sample, accept and spot-check when it is high, revise the codebook and re-run when it is low. Route by agreement: accept the confident codes, send the rest back

Run the model on a labeled gold sample, look at where it agrees with people, and route accordingly. Codes the model gets reliably right go through with a light spot-check. Codes it gets wrong, and the units where it abstained or looked unsure, go to human coders. As people resolve those, the disagreements feed back into the codebook, and the next pass is better for it. It is the human-in-the-loop idea behind pre-annotation, scaled up from a single label to a whole coding scheme.

Doing it in Potato

Potato runs this loop in one tool: a codebook-backed coding scheme, an LLM that pre-annotates, human coders who verify, and reliability metrics over the result. The codebook lives in a span scheme marked as codebook-backed, which is what turns a plain label set into an editable, hierarchical coding scheme.

yaml

annotation_schemes:
- annotation_type: span
  name: codes
  description: Highlight a passage and apply a code from the codebook
  codebook: true
  labels: [access barriers, cost concerns, provider trust]

Under QDA Mode the codebook is open by default, so coders can add, rename, or merge codes as the scheme settles. Once you have a stable codebook, switch it to fixed before you scale up, so the shared scheme stops moving under the model and the coders.

To have a model pre-apply the codes, turn on AI support and point it at the endpoint you use:

yaml

ai_support:
  enabled: true
  endpoint_type: anthropic     # or openai, gemini, ollama, ...
  ai_config:
    model: claude-opus-4-8
    api_key: ${ANTHROPIC_API_KEY}
    temperature: 0.2

The model proposes codes; the annotator confirms or corrects them. To keep automation bias measurable, do not pre-fill the items you reserve for agreement. Leave a blind slice for humans only and compare against it, as the pre-annotation guide describes.

When a pass is done, two exporters give you the deliverables and the audit trail:

bash

python -m potato.export config.yaml --format codebook -o codebook.csv
python -m potato.export config.yaml --format quotation_report \
  --option include_memos=true -o quotations.csv

The codebook export is one row per code with its description and use count, so you can see which codes the model leaned on and which never fired. The quotation_report is one row per coded span, which is the file you actually check the model against. Potato reports Cohen's and Fleiss' kappa over the codes, so the model-versus-human comparison is a number, not a vibe.

Where to go next

Writing Effective Annotation Guidelines, the human side of the same craft.
LLM and Vision Pre-Annotation, for the mechanics of model suggestions and the automation-bias guardrails.
Inter-Annotator Agreement Explained, for the reliability statistics that decide the split.
Bringing Qualitative Coding to Potato, for the codebook, memos, and cases that this workflow builds on.

Codebook-heavy datasets show what a well-specified scheme looks like in practice: the fine-grained emotion codes in GoEmotions, the social-rule judgments in Social Chemistry, and the framing labels in Media Frames.