Documenting Datasets and Models: Datasheets, Data Statements, and Model Cards

A reference to the three standard documentation frameworks for annotated data and the models built on it, what each covers, when to use which, and how reproducibility reporting ties them together.

Three documentation standards have become the norm for machine-learning data: data statements and datasheets for the dataset, model cards for what you train on it. They overlap heavily and none is optional if you want the data trusted and reused. This guide is a reference to what each covers and when to reach for it. For a narrative walk-through of writing one, see the companion post on documenting your annotation dataset; this page is the standards comparison.

Why structured documentation, not a README

An annotated dataset without documentation ages badly. Six months on, nobody can tell how it was sampled, who labeled it, or what a label was supposed to mean, so the data becomes a black box people either trust blindly or throw away. Two specific costs recur: irreproducibility (you cannot rebuild the dataset or explain a discrepancy without the sampling method, guideline version, and annotator pool) and hidden bias (labels from an undocumented, narrow pool carry blind spots that stay invisible until they surface in production). The frameworks below exist to make the who and how legible before either bites.

The three standards

Each framework targets a different artifact and audience, but they were designed to interlock.

Data statements (Bender and Friedman, 2018) are the NLP-specific schema. They characterize a language dataset, curation rationale, language variety and its speakers, annotator demographics, guidelines, and intended use, so a reader can judge how results will generalize and which populations the data underrepresents. Reach for a data statement when the data is text and language variety matters.

Datasheets for datasets (Gebru et al., 2021) are the general-purpose version, borrowed from electronics, where every component ships with a datasheet. They ask a standard question set across motivation, composition, collection process, preprocessing, recommended uses, and maintenance. Use a datasheet for any ML dataset, text or not; it overlaps a data statement heavily, so on a language dataset you are really choosing which question set to organize around, not doing both from scratch.

Model cards (Mitchell et al., 2019) document the model, not the data: its intended use, and, crucially, its performance broken down across demographic and other groups rather than as one aggregate number. A model card is where a fairness problem becomes visible.

The three form a chain. A datasheet or data statement documents the data; a model card documents what was built on it; and the annotator-demographics section of the first is exactly what makes the group-wise evaluation in the last interpretable. Document the annotation well and you are already most of the way to a defensible model card.

Framework	Documents	Best for	Key sections
Data statement	A language dataset	NLP / text data	Curation rationale, language variety, speaker + annotator demographics, guidelines
Datasheet	Any ML dataset	General ML data	Motivation, composition, collection, uses, maintenance
Model card	A trained model	Any released model	Intended use, disaggregated evaluation, limitations

Reproducibility is the fourth leg

Documentation and reproducibility are the same goal from two angles. Pineau et al. (2021) reported on the NeurIPS reproducibility program and distilled it into a reproducibility checklist: report the exact data, the collection and preprocessing steps, the evaluation setup, and enough detail to rerun the work. For an annotation project specifically, the reproducibility-critical facts are the ones a datasheet already asks for, how items were sampled, what guideline version was used, who annotated, and how disagreement was handled. If you can answer those, the dataset is both documented and reproducible; if you cannot, that is a gap to close before release, not after.

A release checklist

Before you publish, confirm you can answer:

How were items sampled, and from where?
What language variety is this, and who wrote the source text?
Who annotated it, how many people, and what is the pool's demographic makeup?
What guidelines did they follow, and which version?
Was disagreement aggregated to a gold label or kept as a distribution? Report agreement either way.
What is this dataset for, and what should it not be used for?

Doing it in Potato

Most of a dataset's documentation already exists as project artifacts, so you are not starting from a blank page. The config is documentation: the YAML records the schemes, label sets, and task structure, version-controlled next to the data, and the instructions you wrote are the guideline section verbatim. If you ran a prestudy phase, the annotator demographics are already stored per annotator, aggregate them into distributions for the annotator-demographics section. And the export keeps the annotator and timestamp on every label, so provenance travels with the data instead of getting stripped:

json

{
  "id": "doc_001",
  "annotations": { "sentiment": "positive" },
  "annotator": "user_1",
  "timestamp": "2024-01-15T10:30:00Z"
}

When you publish to the Hub, generate the dataset card as the last step of the export and fill its sections from the config, the guidelines, and the prestudy demographics you already have.