# Documenting Your Annotation Dataset: Data Statements, Datasheets, and Why Undocumented Data Ages Badly

Source: https://www.potatoannotator.com/blog/documenting-annotation-datasets

You finish the annotation, export the labels, and hand off a file. Six months later someone trains a model on it, gets a strange result, and cannot tell whether the problem is their model or your data, because nobody wrote down who annotated it, how it was sampled, or what the labels were supposed to mean. The labels outlived the context that made them interpretable. This happens constantly, and it is almost entirely preventable with documentation you could have written while the project was fresh.

**A dataset is only as trustworthy as its documentation. A data statement records the curation rationale, the language and its speakers, the annotators and their demographics, the guidelines, and the intended use, so downstream users can judge how the labels will generalize and where they will not. Write it alongside the data, not as an afterthought, and most of it falls out of the annotation project you already ran.** This post is about what to document and how the Potato config already captures a good chunk of it.

## What undocumented data costs you

Two problems show up, and both are expensive later.

The first is irreproducibility. Without the sampling method, the guideline version, and the annotator pool, nobody can rebuild your dataset or explain a discrepancy against it. The data becomes a black box that people either trust blindly or discard.

The second is hidden bias. A model trained on labels from a narrow annotator pool inherits that pool's blind spots, and if the pool was never documented, the bias is invisible until it surfaces in production. This is the exact failure that documentation frameworks were invented to prevent: making the who and how of a dataset legible so the biases can be seen before they ship.

## What a data statement covers

The [data statement (Bender and Friedman, 2018)](https://aclanthology.org/Q18-1041/) is the NLP-specific answer, a schema for characterizing a language dataset so users can understand how results might generalize and what biases the data carries. The parts worth writing down:

- **Curation rationale.** What is in the dataset and why, including how items were sampled. A sample chosen from one subreddit is not the same dataset as a representative draw, and the rationale is where you say so.
- **Language variety.** The specific language and dialect, not just "English." A model built on one variety may fail on another.
- **Speaker demographics.** Who produced the source text.
- **Annotator demographics.** Who produced the labels. On subjective tasks this is decisive, because the annotator pool's composition shapes the labels, which is the whole argument for [collecting demographics in the first place](/blog/collecting-annotator-demographics-responsibly).
- **Annotation guidelines.** The instructions the labels were produced under. The same label name means different things under different guidelines.
- **Intended use.** What the dataset is for, and just as usefully, what it is not for.

![The anatomy of a data statement: a documentation card with sections for curation rationale, language variety, speaker demographics, annotator demographics, annotation guidelines, and intended use, each answering a question a downstream user would otherwise have to guess.](/images/blog/anatomy-of-a-data-statement.svg "A data statement records who and how, so downstream users can judge where the labels generalize")

## Datasheets and model cards

Two neighboring frameworks round this out. [Datasheets for datasets (Gebru et al.)](https://arxiv.org/abs/1803.09010) borrow the idea from the electronics industry, where every component ships with a datasheet: every dataset should ship with a document covering its motivation, composition, collection process, recommended uses, and maintenance. Data statements are the language-specific cousin; datasheets are the general-purpose version, and the two overlap heavily.

Downstream of the data sits the [model card (Mitchell et al., 2019)](https://arxiv.org/abs/1810.03993), which documents a trained model's intended use and its performance broken down across demographic and other groups. The three form a chain: a datasheet or data statement documents the data, a model card documents what was built on it, and the annotator-demographics section of the first is what makes the group-wise evaluation in the last interpretable. Document the annotation well and you are already halfway to a defensible model card.

## A release checklist

Before you release, confirm you can answer:

- How were items sampled, and from where?
- What language variety is this, and who wrote the source text?
- Who annotated it, how many people, and what is the demographic makeup of the pool?
- What guidelines did they follow, and which version?
- How was disagreement handled, aggregated to a gold label or [kept as a distribution](/blog/disagreement-is-signal-not-noise)? Report agreement either way.
- What is this dataset for, and what should it not be used for?

If a question has no answer, that is a gap to close before release, not after.

## Doing it in Potato

The useful thing about running annotation in Potato is that much of the data statement already exists as project artifacts. You do not start the documentation from a blank page.

The config *is* documentation. The YAML records the annotation schemes, the label sets, and the task structure, so the "what were the labels and how were they defined" part of a data statement is version-controlled alongside the data. The instructions and guidelines you wrote into the [annotation guidelines](/docs/guides/writing-annotation-guidelines) are the guideline section, verbatim.

The demographics are already collected. If you ran a prestudy phase, the [annotator demographics](/blog/collecting-annotator-demographics-responsibly) are stored per annotator. Aggregated into distributions, never individual records, they are the annotator-demographics section, ready to paste in.

The export carries the metadata. Potato's [export formats](/docs/features/export-formats) keep the annotator and timestamp on every label, so provenance travels with the data rather than getting stripped at the export step:

```json
{
  "id": "doc_001",
  "annotations": { "sentiment": "positive" },
  "annotator": "user_1",
  "timestamp": "2024-01-15T10:30:00Z"
}
```

When you publish to the Hub, generate a dataset card as part of the export, as the [exporting to HuggingFace](/blog/exporting-to-huggingface) walkthrough shows, and fill its sections from the config, the guidelines, and the prestudy demographics you already have. The documentation stops being a separate writing project and becomes the last step of the annotation one.

## Where to go next

- [Collecting Annotator Demographics Responsibly](/blog/collecting-annotator-demographics-responsibly), for the annotator-demographics section done right.
- [Disagreement Is Signal, Not Noise](/blog/disagreement-is-signal-not-noise), for documenting how you handled disagreement.
- [Writing Effective Annotation Guidelines](/docs/guides/writing-annotation-guidelines), which double as the guideline section of a data statement.
- [Exporting Annotations for ML](/docs/guides/exporting-annotations-for-ml), for getting the labels and their metadata out cleanly.
