Statistical Power and Sample Size for Annotation Studies

How many items you need for a result to mean something, why that is a different question from how many annotators per item, and how to avoid underpowered, over-claimed annotation and evaluation studies.

"How many annotators?" and "how many items?" are two different questions that get confused constantly. Annotator overlap controls how reliable each item's label is; the number of items controls whether a difference you observe is real or noise. A study can have five annotators per item and still be too small to support its conclusion. This guide is about the second axis, statistical power, and how to keep an annotation or evaluation study from over-claiming.

Two budgets, not one

Every annotation project spends effort along two independent axes, and it helps to name them separately:

Overlap (annotators per item): buys you label reliability, the confidence that a single item's label is right. This is the subject of How Many Annotators Do You Need?.
Sample size (number of items): buys you statistical power, the ability to detect a real difference between conditions, models, or groups.

They trade against a fixed budget but solve different problems. Ten annotators labeling 50 items gives you very reliable labels for a sample too small to compare anything. One annotator labeling 5,000 items gives you noisy labels but enough of them to detect a real effect. Which mistake you are about to make depends on which question you are actually asking.

What statistical power is

Statistical power is the probability that your study detects an effect that is genuinely there. Low power means that even when model A really is better than model B, your experiment often fails to show it, and, less obviously, that the "significant" results you do get are more likely to be flukes with inflated effect sizes. The convention is to aim for 80% power, which requires deciding in advance the smallest effect worth detecting and sizing the study to catch it.

The uncomfortable finding is how often this is skipped. Card et al. (2020) ran power analyses across common NLP setups and found that many published comparisons are badly underpowered: to reliably detect the small differences that typical papers claim, especially in human evaluation, you often need hundreds to thousands of items, far more than studies actually use. Their practical takeaway is to run the power calculation before collecting data, not to reverse-engineer significance after.

Getting the significance test right

Having enough items is necessary but not sufficient; you also have to test correctly. Dror et al. (2018) is the standard reference here, and its advice is concrete:

Match the test to the data. NLP metrics are usually not normally distributed, so lean on nonparametric options, bootstrap and permutation tests, rather than assuming a t-test applies.
Correct for multiple comparisons. Testing many models, metrics, or subgroups inflates false positives; adjust (Bonferroni, or better, Benjamini-Hochberg) when you run many tests.
Report effect size and a confidence interval, not just a p-value. With enough items, a difference can be statistically significant and practically meaningless. The effect size and interval tell the reader whether to care.

A workable recipe

State the smallest difference that would matter (say, a 2-point difference in win rate).
Run a power analysis for that effect at 80% power to get a target number of items.
Decide overlap separately, based on how subjective the labels are (see the annotator-count guide).
After collection, use a bootstrap or permutation test, correct for the number of comparisons, and report effect sizes with intervals.

The order matters: sizing the study after seeing the data is how underpowered results get dressed up as findings.

Doing it in Potato

Power is a design decision, not a config key, but Potato's job is to give you clean data to run the analysis on. Set overlap for reliability and instance counts for sample size in task assignment:

yaml

automatic_assignment:
  on: true
  instance_per_annotator: 400    # sample size: items each annotator sees
  labels_per_instance: 3         # overlap: reliability per item

The two knobs are independent on purpose. The export keeps every annotator's individual label with their ID and timestamp, which is what lets you bootstrap resample, by item and by annotator, when you compute significance offline. Preserving the per-annotator labels rather than only the aggregate is what makes a proper power-aware analysis possible; collapse to one gold label too early and you lose the variance the bootstrap needs.