# LLM Annotators vs Humans: When to Automate a Labeling Job and When Not To

Source: https://www.potatoannotator.com/blog/llm-annotators-vs-humans

The question comes up on every labeling project now: do we still need people for this, or can a model just do it? It is a fair question. LLM annotators are fast, they do not get tired, and they cost a fraction of a crowd. The honest answer is that it depends on the task in ways you can actually predict, and the projects that go wrong are usually the ones that never checked.

**An LLM is a good annotator when the task is well-defined, the labels are objective, and you can measure it against a human gold sample. It is a poor annotator when the labels are subjective, culturally loaded, or novel enough that no ground truth exists yet. The safe default is to automate what the model does reliably, verify a sample of the rest, and keep the hard cases with people.** This post is about telling those cases apart.

## The honest answer: it depends on the task

The research does not say "LLMs replace annotators" or "LLMs cannot annotate." It says something more useful, which is that performance splits by task type. [Gilardi, Alizadeh, and Kubli (2023)](https://www.pnas.org/doi/10.1073/pnas.2305016120) found ChatGPT beat crowd workers on relevance, stance, and frame detection, with higher agreement and near-zero cost. But [Ziems and colleagues (2024)](https://aclanthology.org/2024.cl-1.8/), testing 13 models across 25 computational-social-science benchmarks, found the picture is uneven: on classification tasks LLMs reach only fair agreement with humans and rarely beat a fine-tuned model, while on free-form explanation they often produce output that reads better than the crowd's reference answers.

So "can an LLM label this?" is really two questions. Is this the kind of task where models do well? And on my specific data, does the model actually agree with people? You can reason about the first from the task type. The second you have to measure.

## Where LLM annotators fail

The failures are not random. They cluster, which means you can anticipate them.

- **Subjective and cultural labels.** Toxicity, offensiveness, humor, politeness, and moral judgment depend on who is reading. A single model gives one flattened answer where a diverse annotator pool would disagree in informative ways, and that disagreement is often the signal you wanted.
- **Systematic bias in comparisons.** When an LLM judges two responses, it is not a neutral referee. [Zheng and colleagues (2023)](https://arxiv.org/abs/2306.05685) documented position bias (it favors the first option shown), verbosity bias (it rewards longer answers), and self-enhancement bias (it prefers text in its own style). These are consistent, so they push your whole dataset in one direction rather than adding noise.
- **No ground truth yet.** If you are building a brand-new coding scheme, there is nothing to validate the model against, and a confident wrong label is worse than an honest gap. New schemes need human coders first, if only to create the gold set.
- **Silent drift.** A model's behavior shifts across a long corpus and across versions. Without a fixed gold sample you re-check against, you will not notice the label distribution moving under you.

None of these mean "never use a model." They mean the model's output is a strong draft, not a finished label.

## The pattern that works

The workable setup is not all-model or all-human. It is a triage: send each item down the lane that fits its difficulty.

![A decision flow sorting items into three lanes: automate the objective high-agreement items, verify a sample of the medium ones, and send subjective or high-stakes items to human annotators.](/images/blog/annotate-automate-verify-human.svg "Automate what the model does well, verify a sample, keep the hard cases with people")

Run the model on a labeled gold sample first and read the agreement per label, not just overall. Where it agrees with people, let it carry those items and spot-check a slice. Where agreement is middling, keep the model's suggestion but have a person confirm every one. Where the label is subjective or the decision is high-stakes, leave it with human annotators and use the model, at most, as a hint. The proportions shift over a project as you learn where the model is trustworthy, but the shape stays the same.

One guardrail matters throughout: keep a blind, human-only slice that the model never touches. That is your yardstick. Without it, [automation bias](https://en.wikipedia.org/wiki/Automation_bias) sets in, verifiers rubber-stamp plausible suggestions, and your measured agreement drifts upward while real quality does not.

## Cost and quality are not the same axis

It is easy to frame this as cheap-model versus expensive-humans, but that hides the real trade. A model label costs almost nothing to produce and something real to trust: the gold set you built to validate it, the human verification pass, the spot-checks. A human label costs more up front and less to trust. For a large, objective task the model wins on total cost once the validation is amortized. For a small or subjective task, the validation overhead can cost more than just having people label it. Do the arithmetic on your task instead of assuming the model is cheaper.

## Doing it in Potato

Potato is built to run the mixed workflow rather than forcing an all-or-nothing choice. Turn on [AI support](/docs/features/ai-support) to have a model pre-annotate, then let people verify:

```yaml
ai_support:
  enabled: true
  endpoint_type: openai       # or anthropic, gemini, ollama, ...
  ai_config:
    model: gpt-4
    api_key: ${OPENAI_API_KEY}
    temperature: 0.2
```

The model proposes a label; the annotator confirms or corrects it, and the verified label is what gets saved. For the routing itself, a [triage](/showcase/triage) scheme lets a person move fast through model suggestions, keeping the clear ones and flagging the rest for closer annotation.

To measure the model against people, do not pre-fill the items you reserve for agreement. Leave a blind slice, have humans label it, and compare with [Cohen's or Fleiss' kappa](/docs/guides/inter-annotator-agreement). That number, per label, is what decides which lane each part of your task belongs in. The [pre-annotation guide](/docs/guides/llm-pre-annotation) covers the automation-bias guardrails in more detail.

## Further reading

- [Codebooks for AI Annotators](/blog/codebooks-for-ai-annotators), for turning a coding scheme into a model an LLM can run.
- [LLM and Vision Pre-Annotation](/docs/guides/llm-pre-annotation), for the mechanics of model suggestions and verification.
- [Can You Trust Your LLM Judge?](/blog/trust-your-llm-judge-calibration), on calibrating an LLM judge against human ratings.
- [Active Learning for Annotation](/docs/guides/active-learning), for spending human effort on the items that teach the model the most.