# Beyond Full Overlap: Adaptive Annotator Coverage for Large Datasets

Source: https://www.potatoannotator.com/blog/adaptive-annotator-coverage

There is a standard tension in any sizeable annotation project. If every item gets two or three annotators, you can measure agreement and trust your labels, but you have just multiplied your budget by two or three. If every item gets one annotator, you label three times as much data for the money and have no idea how reliable any of it is.

The usual compromise is well known to anyone who has run a study: single-annotate most of the corpus, and double- or triple-annotate a small sample to keep an eye on quality. The problem has always been making the tooling do that cleanly, and then doing something with the overlap once you have it. **Potato 2.6** builds that design in, through two config blocks (`num_annotators_per_item` and `per_annotator_quota`) plus adaptive boosts and adjudication routing.

This post walks through the coverage design from the simple case to the adaptive one. The [heterogeneous coverage docs](/docs/deployment/heterogeneous-coverage) have the full reference.

![Default single coverage, a stratified overlap sample at three annotators, adaptive boosts on disagreement, and adjudication routing](/images/blog/adaptive-coverage.svg "Adaptive annotator coverage in Potato")

## Per-item caps with an overlap sample

`num_annotators_per_item` accepts a single integer for a uniform cap, or a structured mapping when you want different items covered differently. The common shape is a default of one, with a stratified sample raised to three:

```yaml
num_annotators_per_item:
  default: 1
  overlap_sample:
    fraction: 0.1
    count: 3
    stratify_by: domain
    seed: 42
  min: 1
```

The `overlap_sample` block raises the cap on a deterministic subset of items. Sampling happens once at startup, and the chosen items are stamped internally so the assignment logic treats them as high-coverage from then on. The fields are straightforward: `fraction` is the proportion sampled, `count` is the raised cap (it must exceed the default), and `seed` makes the choice reproducible across restarts.

The detail worth dwelling on is `stratify_by`. Point it at a field in your data (`domain` here) and the fraction is applied *per stratum* rather than across the whole pool. Every category contributes to the overlap sample proportionally, so you are not measuring agreement on a sample that happens to be 90% one domain. If your corpus mixes news, social media, and clinical text, each shows up in the quality sample in proportion to its size.

## Adaptive boost: spend more where it is hard

A fixed overlap sample is chosen blind, before anyone has annotated anything. But the items that most need a second and third look are the ones where annotators actually disagree, and you only learn which those are after the first pass. The adaptive boost handles exactly that:

```yaml
num_annotators_per_item:
  default: 1
  adaptive:
    enabled: true
    disagreement_threshold: 0.5
    boost_to: 3
```

Once an item has at least two annotations and its disagreement score crosses `disagreement_threshold`, its cap is raised to `boost_to` and the item re-enters the assignment queue for another pass. The boost is one-shot per item, so a contentious item gets one escalation rather than spiraling. This is coverage that follows the difficulty of the data instead of guessing at it up front.

## Per-annotator quotas

Coverage caps control how many annotators each *item* gets. A separate block controls how many items each *annotator* gets, which you usually want to vary by expertise or contract:

```yaml
per_annotator_quota:
  default: 100
  by_user:
    alice: 30
  by_user_role:
    expert: 30
    novice: 200

user_roles:
  alice: expert
  carol: novice
```

Resolution runs most-specific first: `by_user[uid]`, then `by_user_role[user_roles[uid]]`, then `default`. So you can cap a specific expert at 30 items, every other expert at 30 by role, and novices at 200, without the two systems interfering with the per-item caps above.

## Turning overlap into a decision

Collecting overlap is only half the job; the point is to act on the disagreements. With the adjudication block enabled, overlap-sample items that reach their cap are scored automatically and pushed into an adjudication queue when agreement falls below your threshold:

```yaml
adjudication:
  enabled: true
  adjudicator_users: [admin]
  min_annotations: 2
  agreement_threshold: 0.75
```

The effect is that low-agreement items surface the moment the sample saturates, rather than waiting for someone to remember to rebuild the queue by hand. An adjudicator opens the queue and sees the genuinely contested items, already filtered down from the bulk that everyone agreed on.

## Reading the agreement

Once overlap-sample items saturate, agreement statistics are available at `/admin/iaa`. The endpoint computes the metric appropriate to each scheme's type rather than forcing one number on everything: [Cohen's and Fleiss' kappa](/docs/guides/inter-annotator-agreement) for nominal schemes, weighted kappa for ordinal ones, and token-level kappa plus span F1 for spans. That matters because a κ computed as if your ordinal Likert ratings were unordered categories will understate real agreement.

## Trying it

A runnable demonstration ships with the release. From the repository root:

```bash
python potato/flask_server.py start examples/advanced/heterogeneous-coverage/config.yaml -p 8000
```

It uses 20 items across two domains, samples 20% for three-annotator overlap stratified by domain, enables an adaptive boost at a 0.5 threshold, defines two expertise tiers, and routes low-agreement items into adjudication: the whole design above, end to end.

## The shape of a good coverage plan

Put together, the design lets you decide where your annotation budget goes instead of spreading it uniformly. Most items get one pass. A stratified slice gets three, so you can report reliability across the whole corpus, not just one corner of it. Items that turn out to be genuinely hard get escalated automatically, and the contested ones route to an adjudicator. You spend the most on the data that is most uncertain, and you can defend every coverage decision in a methods section.

How many annotators you actually need for a given task is its own question; the [how many annotators do you need](/blog/how-many-annotators-do-you-need) post works through the rules of thumb. This release is about making whatever answer you land on easy to express. Heterogeneous coverage ships in **Potato 2.6**; see the [heterogeneous coverage docs](/docs/deployment/heterogeneous-coverage) and the [task assignment reference](/docs/deployment/task-assignment) for everything the blocks above can do.