Export Formats
Export Potato annotations to JSON, JSONL, CSV, CoNLL, HuggingFace Datasets, spaCy, COCO, YOLO, Pascal VOC, and Apache Parquet for downstream ML pipelines.
Export Formats
Potato provides two levels of export:
- Native Export - Annotations are automatically saved in JSON/JSONL/CSV/TSV/Parquet format as configured
- Export CLI (New in v2.2.0) -
python -m potato.exportconverts annotations to specialized formats (COCO, YOLO, Pascal VOC, CoNLL-2003, CoNLL-U, Mask PNG, Parquet, EAF, TextGrid)
This page covers both the built-in formats and the export CLI, plus example conversion scripts for common targets.
Basic Export Formats
JSON
The default output format. Each annotator's work is saved as a JSON file:
{
"id": "doc_001",
"annotations": {
"sentiment": "positive",
"confidence": 4
},
"annotator": "user_1",
"timestamp": "2024-01-15T10:30:00Z"
}Configure in YAML:
export_annotation_format: "json"
output_annotation_dir: "output/"JSON Lines (JSONL)
One annotation per line, ideal for streaming and large datasets:
{"id": "doc_001", "annotations": {"sentiment": "positive"}, "annotator": "user_1"}
{"id": "doc_002", "annotations": {"sentiment": "negative"}, "annotator": "user_1"}export_annotation_format: "jsonl"CSV
Tabular format for spreadsheet analysis:
id,annotator,sentiment,confidence,timestamp
doc_001,user_1,positive,4,2024-01-15T10:30:00Z
doc_002,user_1,negative,2,2024-01-15T10:31:00Zexport_annotation_format: "csv"TSV
Tab-separated values:
export_annotation_format: "tsv"Export CLI
New in v2.2.0
The export CLI converts Potato annotations to specialized formats with a single command:
# List available export formats
python -m potato.export --list-formats
# Export to COCO format
python -m potato.export --config config.yaml --format coco --output ./export/
# Export to YOLO format
python -m potato.export --config config.yaml --format yolo --output ./export/
# Export with options
python -m potato.export --config config.yaml --format coco --output ./export/ \
--option split_ratio=0.8 --option include_unlabeled=falseCLI Options
| Option | Description |
|---|---|
--config, -c | Path to Potato YAML config file |
--format, -f | Export format (coco, yolo, pascal_voc, conll_2003, conll_u, mask, parquet, eaf, textgrid) |
--output, -o | Output directory (default: ./export_output) |
--option | Format-specific option as key=value (repeatable) |
--list-formats | List available formats and exit |
--verbose, -v | Enable verbose logging |
Supported Export Formats
| Format | ID | Best For |
|---|---|---|
| COCO | coco | Object detection, instance segmentation |
| YOLO | yolo | YOLO model training |
| Pascal VOC | pascal_voc | Object detection (XML) |
| CoNLL-2003 | conll_2003 | NER, sequence labeling |
| CoNLL-U | conll_u | POS tagging, dependency parsing |
| Segmentation Masks | mask | Semantic/instance segmentation |
| Apache Parquet | parquet | Columnar storage for large datasets |
| ELAN EAF | eaf | Speech/video annotation timelines |
| Praat TextGrid | textgrid | Phonetic/speech annotation |
Format Compatibility Matrix
| Annotation Type | COCO | YOLO | Pascal VOC | CoNLL-2003 | CoNLL-U | Mask |
|---|---|---|---|---|---|---|
| Bounding boxes | Yes | Yes | Yes | - | - | - |
| Polygons | Yes | - | - | - | - | Yes |
| Keypoints | Yes | - | - | - | - | - |
| Text spans | - | - | - | Yes | Yes | - |
| Classifications | Partial | - | - | - | - | - |
Programmatic Export
Use the export registry directly in Python:
from potato.export.registry import export_registry
from potato.export.cli import build_export_context
context = build_export_context("path/to/config.yaml")
result = export_registry.export("coco", context, "./output/")
if result.success:
print(f"Exported {len(result.files_written)} files")Custom Exporters
Create custom exporters by subclassing BaseExporter:
from potato.export.base import BaseExporter, ExportContext, ExportResult
class MyExporter(BaseExporter):
format_name = "my_format"
description = "My custom export format"
file_extensions = [".myformat"]
def can_export(self, context: ExportContext) -> tuple:
has_spans = any(ann.get("spans") for ann in context.annotations)
if not has_spans:
return False, "No span annotations found"
return True, None
def export(self, context: ExportContext, output_path: str,
options: dict = None) -> ExportResult:
# Perform the export
return ExportResult(
success=True,
format_name=self.format_name,
files_written=["output.myformat"],
stats={"annotations": len(context.annotations)}
)
from potato.export.registry import export_registry
export_registry.register(MyExporter())NLP Export Formats
CoNLL Format
Standard format for sequence labeling tasks (NER, POS tagging):
The O
quick B-ADJ
brown I-ADJ
fox B-NOUN
jumps B-VERB
Example conversion script that reads Potato span annotations and writes CoNLL format:
import json
from pathlib import Path
def convert_to_conll(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to CoNLL format."""
with open(output_file, "w") as out:
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
tokens = text.split()
labels = ["O"] * len(tokens)
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
start_tok = span.get("start")
end_tok = span.get("end")
label = span.get("label", "ENT")
for i in range(start_tok, min(end_tok, len(tokens))):
prefix = "B" if i == start_tok else "I"
labels[i] = f"{prefix}-{label}"
for token, label in zip(tokens, labels):
out.write(f"{token}\t{label}\n")
out.write("\n")
# Usage
convert_to_conll("annotation_output/", "annotations.conll", "entities")IOB2 Format
Inside-Outside-Beginning tagging for entity recognition:
John B-PER
Smith I-PER
works O
at O
Google B-ORG
spaCy Format
Example conversion script that reads Potato output and creates a spaCy DocBin for NER training:
import json
import spacy
from spacy.tokens import DocBin
from pathlib import Path
def convert_to_spacy(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to spaCy DocBin format."""
nlp = spacy.blank("en")
doc_bin = DocBin()
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
doc = nlp.make_doc(text)
ents = []
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
char_span = doc.char_span(
span["start_offset"], span["end_offset"],
label=span["label"]
)
if char_span is not None:
ents.append(char_span)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk(output_file)
# Usage
convert_to_spacy("annotation_output/", "train.spacy", "entities")The output can be directly used with spacy train:
python -m spacy train config.cfg --paths.train ./annotations.spacyHuggingFace Datasets
Example conversion script using the datasets library to convert Potato output into a HuggingFace Dataset:
import json
from pathlib import Path
from datasets import Dataset, DatasetDict
def convert_to_huggingface(annotations_dir, output_dir, scheme_names):
"""Convert Potato annotations to a HuggingFace Dataset."""
records = []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
record = {"id": item_id, "text": item_data.get("text", "")}
annotations = item_data.get("annotations", {})
for scheme in scheme_names:
record[scheme] = annotations.get(scheme)
records.append(record)
dataset = Dataset.from_list(records)
dataset.save_to_disk(output_dir)
print(f"Saved {len(records)} examples to {output_dir}")
# Usage
convert_to_huggingface("annotation_output/", "hf_dataset/", ["sentiment", "entities"])Load in your training script:
from datasets import load_from_disk
dataset = load_from_disk("hf_dataset/")Computer Vision Export Formats
COCO Format
Standard format for object detection and segmentation:
{
"images": [
{"id": 1, "file_name": "image_001.jpg", "width": 640, "height": 480}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 150, 200, 300],
"area": 60000,
"segmentation": [[100, 150, 300, 150, 300, 450, 100, 450]]
}
],
"categories": [
{"id": 1, "name": "person"}
]
}Example conversion script that reads Potato bounding box annotations and writes COCO JSON:
import json
from pathlib import Path
from PIL import Image
def convert_to_coco(annotations_dir, images_dir, output_file, scheme_name="objects"):
"""Convert Potato bounding box annotations to COCO format."""
coco = {"images": [], "annotations": [], "categories": []}
category_map = {}
ann_id = 1
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for img_idx, (item_id, item_data) in enumerate(data.items(), start=1):
# Get image dimensions
img_path = Path(images_dir) / item_data.get("filename", f"{item_id}.jpg")
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w, h = item_data.get("width", 0), item_data.get("height", 0)
coco["images"].append({
"id": img_idx,
"file_name": img_path.name,
"width": w, "height": h
})
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
label = bbox["label"]
if label not in category_map:
cat_id = len(category_map) + 1
category_map[label] = cat_id
coco["categories"].append({"id": cat_id, "name": label})
x, y = bbox["x"], bbox["y"]
bw, bh = bbox["width"], bbox["height"]
coco["annotations"].append({
"id": ann_id, "image_id": img_idx,
"category_id": category_map[label],
"bbox": [x, y, bw, bh],
"area": bw * bh, "iscrowd": 0
})
ann_id += 1
with open(output_file, "w") as f:
json.dump(coco, f, indent=2)
# Usage
convert_to_coco("annotation_output/", "images/", "coco_annotations.json", "objects")YOLO Format
One text file per image for YOLO training:
# class_id center_x center_y width height (normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2
Example conversion script that writes YOLO-format label files from Potato annotations:
import json
from pathlib import Path
from PIL import Image
def convert_to_yolo(annotations_dir, images_dir, output_dir, scheme_name="objects",
class_names=None):
"""Convert Potato bounding box annotations to YOLO format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
class_names = class_names or []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
img_w, img_h = img.size
else:
img_w = item_data.get("width", 1)
img_h = item_data.get("height", 1)
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
label_file = Path(output_dir) / (Path(filename).stem + ".txt")
with open(label_file, "w") as out:
for bbox in bboxes:
label = bbox["label"]
class_id = class_names.index(label) if label in class_names else 0
cx = (bbox["x"] + bbox["width"] / 2) / img_w
cy = (bbox["y"] + bbox["height"] / 2) / img_h
nw = bbox["width"] / img_w
nh = bbox["height"] / img_h
out.write(f"{class_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
# Usage
convert_to_yolo(
"annotation_output/", "images/", "yolo_labels/",
"objects", class_names=["person", "car", "dog"]
)Pascal VOC Format
XML format used by many detection frameworks:
<annotation>
<filename>image_001.jpg</filename>
<size>
<width>640</width>
<height>480</height>
</size>
<object>
<name>person</name>
<bndbox>
<xmin>100</xmin>
<ymin>150</ymin>
<xmax>300</xmax>
<ymax>450</ymax>
</bndbox>
</object>
</annotation>Example conversion script that writes Pascal VOC XML files from Potato annotations:
import json
import xml.etree.ElementTree as ET
from pathlib import Path
from PIL import Image
def convert_to_voc(annotations_dir, images_dir, output_dir, scheme_name="objects"):
"""Convert Potato bounding box annotations to Pascal VOC XML format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w = item_data.get("width", 0)
h = item_data.get("height", 0)
root = ET.Element("annotation")
ET.SubElement(root, "filename").text = filename
size_el = ET.SubElement(root, "size")
ET.SubElement(size_el, "width").text = str(w)
ET.SubElement(size_el, "height").text = str(h)
ET.SubElement(size_el, "depth").text = "3"
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
obj = ET.SubElement(root, "object")
ET.SubElement(obj, "name").text = bbox["label"]
bndbox = ET.SubElement(obj, "bndbox")
ET.SubElement(bndbox, "xmin").text = str(int(bbox["x"]))
ET.SubElement(bndbox, "ymin").text = str(int(bbox["y"]))
ET.SubElement(bndbox, "xmax").text = str(int(bbox["x"] + bbox["width"]))
ET.SubElement(bndbox, "ymax").text = str(int(bbox["y"] + bbox["height"]))
tree = ET.ElementTree(root)
xml_file = Path(output_dir) / (Path(filename).stem + ".xml")
tree.write(xml_file, encoding="unicode", xml_declaration=True)
# Usage
convert_to_voc("annotation_output/", "images/", "voc_annotations/", "objects")Custom Export Scripts
Basic Export Script
import json
import os
from pathlib import Path
def export_annotations(input_dir, output_file, format="json"):
"""Combine all annotator files into a single export."""
all_annotations = []
for file in Path(input_dir).glob("*.json"):
with open(file) as f:
data = json.load(f)
all_annotations.extend(data)
# Deduplicate by ID (keep latest)
by_id = {}
for ann in all_annotations:
by_id[ann["id"]] = ann
with open(output_file, "w") as f:
json.dump(list(by_id.values()), f, indent=2)
# Usage
export_annotations("output/", "combined_annotations.json")Aggregating Multiple Annotators
from collections import Counter
def aggregate_labels(annotations_dir, scheme_name):
"""Majority vote aggregation for classification tasks."""
from pathlib import Path
import json
# Collect all labels per item
item_labels = {}
for file in Path(annotations_dir).glob("*.json"):
with open(file) as f:
for ann in json.load(f):
item_id = ann["id"]
label = ann["annotations"].get(scheme_name)
if item_id not in item_labels:
item_labels[item_id] = []
item_labels[item_id].append(label)
# Majority vote
aggregated = {}
for item_id, labels in item_labels.items():
counter = Counter(labels)
aggregated[item_id] = counter.most_common(1)[0][0]
return aggregatedComputing Inter-Annotator Agreement
from sklearn.metrics import cohen_kappa_score
import numpy as np
def compute_agreement(annotations_dir, scheme_name):
"""Compute Cohen's Kappa for overlapping annotations."""
# Load annotations from two annotators
ann1 = load_annotations(f"{annotations_dir}/user_1.json")
ann2 = load_annotations(f"{annotations_dir}/user_2.json")
# Find overlapping items
common_ids = set(ann1.keys()) & set(ann2.keys())
labels1 = [ann1[id][scheme_name] for id in common_ids]
labels2 = [ann2[id][scheme_name] for id in common_ids]
kappa = cohen_kappa_score(labels1, labels2)
return kappaBest Practices
1. Export Regularly
Set up automated exports for backup and analysis:
# Add to your workflow
import schedule
def daily_export():
export_annotations("output/", f"exports/annotations_{date.today()}.json")
schedule.every().day.at("18:00").do(daily_export)2. Include Metadata
Preserve context in exports:
export_data = {
"metadata": {
"task_name": "Sentiment Analysis",
"exported_at": datetime.now().isoformat(),
"total_annotations": len(annotations),
"annotators": list(set(a["annotator"] for a in annotations))
},
"annotations": annotations
}3. Validate Exports
Check export integrity:
def validate_export(export_file, original_count):
with open(export_file) as f:
exported = json.load(f)
assert len(exported) == original_count, "Missing annotations"
assert all("id" in a for a in exported), "Missing IDs"
print(f"Export validated: {len(exported)} annotations")4. Version Your Exports
Use timestamps or version numbers:
exports/
annotations_v1_2024-01-15.json
annotations_v2_2024-01-20.json
annotations_final_2024-01-25.json
Integration Examples
Training a HuggingFace Model
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
# Load exported data
with open("aggregated_annotations.json") as f:
data = json.load(f)
# Create dataset
dataset = Dataset.from_list([
{"text": item["text"], "label": item["sentiment"]}
for item in data
])
# Train model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
# ... continue with trainingTraining spaCy NER
import spacy
from spacy.tokens import DocBin
# Load exported spans
with open("ner_annotations.json") as f:
data = json.load(f)
nlp = spacy.blank("en")
doc_bin = DocBin()
for item in data:
doc = nlp.make_doc(item["text"])
ents = []
for span in item["entities"]:
ent = doc.char_span(span["start"], span["end"], label=span["label"])
if ent:
ents.append(ent)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk("./train.spacy")YOLO Training
# After exporting to YOLO format
yolo train data=dataset.yaml model=yolov8n.pt epochs=100dataset.yaml:
train: ./images/train
val: ./images/val
nc: 3
names: ['person', 'car', 'dog']