Skip to content
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

Designing Data Formats for Annotation

How to structure input data (JSON, JSONL, CSV) for an annotation project, what fields Potato expects, and how to plan for clean export to training pipelines.

Good annotation starts with well-structured input. Each item needs a stable unique identifier and the content to be labeled; everything else is optional context. Getting this right at the start saves painful re-runs later, because annotations are keyed to your item IDs.

Common interchange formats are JSON, JSON Lines (one object per line, ideal for large datasets), and CSV. Potato reads all three. For the full reference see Data Formats.

The minimum each item needs

  • A unique ID that never changes. Annotations are stored against this ID, so if you renumber items mid-project you lose the link to existing labels.
  • The content to annotate: a text field, an image URL, an audio path, or a structured trace.

A JSONL file for a text task looks like this:

json
{"id": "rev_001", "text": "The battery lasts all day. Highly recommend."}
{"id": "rev_002", "text": "Stopped working after a week."}

You tell Potato which keys to use:

yaml
item_properties:
  id_key: id
  text_key: text
 
data_files:
  - "data/reviews.jsonl"

Carry context, but keep it separate from labels

Extra fields, a source URL, a timestamp, a model name, can ride along on each item and be shown to annotators without becoming labels. Keep them clearly named so the export is easy to read later.

Plan the export before you label

Decide early how labeled data will feed your pipeline. Potato exports to JSON, JSONL, and CSV, and to ML-native formats such as CoNLL for sequence labeling, Hugging Face Datasets, spaCy, and COCO/YOLO for vision. Choosing the target format up front tells you which fields and ID scheme to use now. See Exporting Annotations for ML.

yaml
output_annotation_dir: "annotation_output/"
output_annotation_format: "jsonl"

Further reading