Exportformate
Annotationen in verschiedene Formate für ML-Frameworks und Analysetools exportieren.
Exportformate
Potato bietet zwei Exportebenen:
- Nativer Export – Annotationen werden automatisch im konfigurierten JSON/JSONL/CSV/TSV-Format gespeichert
- Export CLI (Neu in v2.2.0) –
python -m potato.exportkonvertiert Annotationen in spezialisierte Formate (COCO, YOLO, Pascal VOC, CoNLL-2003, CoNLL-U, Mask PNG)
Diese Seite behandelt sowohl die eingebauten Formate als auch die Export-CLI sowie Beispiel-Konvertierungsskripte für gängige Zielformate.
Grundlegende Exportformate
JSON
Das Standard-Ausgabeformat. Die Arbeit jedes Annotators wird als JSON-Datei gespeichert:
{
"id": "doc_001",
"annotations": {
"sentiment": "positive",
"confidence": 4
},
"annotator": "user_1",
"timestamp": "2024-01-15T10:30:00Z"
}In YAML konfigurieren:
output_annotation_format: "json"
output_annotation_dir: "output/"JSON Lines (JSONL)
Eine Annotation pro Zeile, ideal für Streaming und große Datensätze:
{"id": "doc_001", "annotations": {"sentiment": "positive"}, "annotator": "user_1"}
{"id": "doc_002", "annotations": {"sentiment": "negative"}, "annotator": "user_1"}output_annotation_format: "jsonl"CSV
Tabellarisches Format für Tabellenkalkulationsanalysen:
id,annotator,sentiment,confidence,timestamp
doc_001,user_1,positive,4,2024-01-15T10:30:00Z
doc_002,user_1,negative,2,2024-01-15T10:31:00Zoutput_annotation_format: "csv"TSV
Tabulatorgetrennte Werte:
output_annotation_format: "tsv"Export CLI
Neu in v2.2.0
Die Export-CLI konvertiert Potato-Annotationen mit einem einzigen Befehl in spezialisierte Formate:
# List available export formats
python -m potato.export --list-formats
# Export to COCO format
python -m potato.export --config config.yaml --format coco --output ./export/
# Export to YOLO format
python -m potato.export --config config.yaml --format yolo --output ./export/
# Export with options
python -m potato.export --config config.yaml --format coco --output ./export/ \
--option split_ratio=0.8 --option include_unlabeled=falseCLI-Optionen
| Option | Beschreibung |
|---|---|
--config, -c | Pfad zur Potato YAML-Konfigurationsdatei |
--format, -f | Exportformat (coco, yolo, pascal_voc, conll_2003, conll_u, mask) |
--output, -o | Ausgabeverzeichnis (Standard: ./export_output) |
--option | Formatspezifische Option als Schlüssel=Wert (wiederholbar) |
--list-formats | Verfügbare Formate auflisten und beenden |
--verbose, -v | Ausführliches Logging aktivieren |
Unterstützte Exportformate
| Format | ID | Geeignet für |
|---|---|---|
| COCO | coco | Objekterkennung, Instanzsegmentierung |
| YOLO | yolo | YOLO-Modelltraining |
| Pascal VOC | pascal_voc | Objekterkennung (XML) |
| CoNLL-2003 | conll_2003 | NER, Sequenzbeschriftung |
| CoNLL-U | conll_u | POS-Tagging, Abhängigkeitsparsing |
| Segmentierungsmasken | mask | Semantische/Instanzsegmentierung |
Format-Kompatibilitätsmatrix
| Annotationstyp | COCO | YOLO | Pascal VOC | CoNLL-2003 | CoNLL-U | Mask |
|---|---|---|---|---|---|---|
| Bounding Boxes | Ja | Ja | Ja | – | – | – |
| Polygone | Ja | – | – | – | – | Ja |
| Keypoints | Ja | – | – | – | – | – |
| Textspans | – | – | – | Ja | Ja | – |
| Klassifikationen | Teilweise | – | – | – | – | – |
Programmatischer Export
Die Export-Registry direkt in Python nutzen:
from potato.export.registry import export_registry
from potato.export.cli import build_export_context
context = build_export_context("path/to/config.yaml")
result = export_registry.export("coco", context, "./output/")
if result.success:
print(f"Exported {len(result.files_written)} files")Benutzerdefinierte Exporter
Benutzerdefinierte Exporter durch Unterklassenbildung von BaseExporter erstellen:
from potato.export.base import BaseExporter, ExportContext, ExportResult
class MyExporter(BaseExporter):
format_name = "my_format"
description = "My custom export format"
file_extensions = [".myformat"]
def can_export(self, context: ExportContext) -> tuple:
has_spans = any(ann.get("spans") for ann in context.annotations)
if not has_spans:
return False, "No span annotations found"
return True, None
def export(self, context: ExportContext, output_path: str,
options: dict = None) -> ExportResult:
# Perform the export
return ExportResult(
success=True,
format_name=self.format_name,
files_written=["output.myformat"],
stats={"annotations": len(context.annotations)}
)
from potato.export.registry import export_registry
export_registry.register(MyExporter())NLP-Exportformate
CoNLL-Format
Standardformat für Sequenzbeschriftungsaufgaben (NER, POS-Tagging):
The O
quick B-ADJ
brown I-ADJ
fox B-NOUN
jumps B-VERB
Beispiel-Konvertierungsskript, das Potato-Span-Annotationen liest und im CoNLL-Format schreibt:
import json
from pathlib import Path
def convert_to_conll(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to CoNLL format."""
with open(output_file, "w") as out:
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
tokens = text.split()
labels = ["O"] * len(tokens)
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
start_tok = span.get("start")
end_tok = span.get("end")
label = span.get("label", "ENT")
for i in range(start_tok, min(end_tok, len(tokens))):
prefix = "B" if i == start_tok else "I"
labels[i] = f"{prefix}-{label}"
for token, label in zip(tokens, labels):
out.write(f"{token}\t{label}\n")
out.write("\n")
# Usage
convert_to_conll("annotation_output/", "annotations.conll", "entities")IOB2-Format
Inside-Outside-Beginning-Tagging für die Entitätserkennung:
John B-PER
Smith I-PER
works O
at O
Google B-ORG
spaCy-Format
Beispiel-Konvertierungsskript, das Potato-Ausgaben liest und eine spaCy DocBin für NER-Training erstellt:
import json
import spacy
from spacy.tokens import DocBin
from pathlib import Path
def convert_to_spacy(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to spaCy DocBin format."""
nlp = spacy.blank("en")
doc_bin = DocBin()
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
doc = nlp.make_doc(text)
ents = []
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
char_span = doc.char_span(
span["start_offset"], span["end_offset"],
label=span["label"]
)
if char_span is not None:
ents.append(char_span)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk(output_file)
# Usage
convert_to_spacy("annotation_output/", "train.spacy", "entities")Die Ausgabe kann direkt mit spacy train genutzt werden:
python -m spacy train config.cfg --paths.train ./annotations.spacyHuggingFace Datasets
Beispiel-Konvertierungsskript mit der datasets-Bibliothek zur Konvertierung von Potato-Ausgaben in ein HuggingFace Dataset:
import json
from pathlib import Path
from datasets import Dataset, DatasetDict
def convert_to_huggingface(annotations_dir, output_dir, scheme_names):
"""Convert Potato annotations to a HuggingFace Dataset."""
records = []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
record = {"id": item_id, "text": item_data.get("text", "")}
annotations = item_data.get("annotations", {})
for scheme in scheme_names:
record[scheme] = annotations.get(scheme)
records.append(record)
dataset = Dataset.from_list(records)
dataset.save_to_disk(output_dir)
print(f"Saved {len(records)} examples to {output_dir}")
# Usage
convert_to_huggingface("annotation_output/", "hf_dataset/", ["sentiment", "entities"])In Ihrem Trainingsskript laden:
from datasets import load_from_disk
dataset = load_from_disk("hf_dataset/")Computer-Vision-Exportformate
COCO-Format
Standardformat für Objekterkennung und Segmentierung:
{
"images": [
{"id": 1, "file_name": "image_001.jpg", "width": 640, "height": 480}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 150, 200, 300],
"area": 60000,
"segmentation": [[100, 150, 300, 150, 300, 450, 100, 450]]
}
],
"categories": [
{"id": 1, "name": "person"}
]
}Beispiel-Konvertierungsskript, das Potato-Bounding-Box-Annotationen liest und COCO JSON schreibt:
import json
from pathlib import Path
from PIL import Image
def convert_to_coco(annotations_dir, images_dir, output_file, scheme_name="objects"):
"""Convert Potato bounding box annotations to COCO format."""
coco = {"images": [], "annotations": [], "categories": []}
category_map = {}
ann_id = 1
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for img_idx, (item_id, item_data) in enumerate(data.items(), start=1):
# Get image dimensions
img_path = Path(images_dir) / item_data.get("filename", f"{item_id}.jpg")
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w, h = item_data.get("width", 0), item_data.get("height", 0)
coco["images"].append({
"id": img_idx,
"file_name": img_path.name,
"width": w, "height": h
})
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
label = bbox["label"]
if label not in category_map:
cat_id = len(category_map) + 1
category_map[label] = cat_id
coco["categories"].append({"id": cat_id, "name": label})
x, y = bbox["x"], bbox["y"]
bw, bh = bbox["width"], bbox["height"]
coco["annotations"].append({
"id": ann_id, "image_id": img_idx,
"category_id": category_map[label],
"bbox": [x, y, bw, bh],
"area": bw * bh, "iscrowd": 0
})
ann_id += 1
with open(output_file, "w") as f:
json.dump(coco, f, indent=2)
# Usage
convert_to_coco("annotation_output/", "images/", "coco_annotations.json", "objects")YOLO-Format
Eine Textdatei pro Bild für das YOLO-Training:
# class_id center_x center_y width height (normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2
Beispiel-Konvertierungsskript, das YOLO-Format-Label-Dateien aus Potato-Annotationen schreibt:
import json
from pathlib import Path
from PIL import Image
def convert_to_yolo(annotations_dir, images_dir, output_dir, scheme_name="objects",
class_names=None):
"""Convert Potato bounding box annotations to YOLO format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
class_names = class_names or []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
img_w, img_h = img.size
else:
img_w = item_data.get("width", 1)
img_h = item_data.get("height", 1)
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
label_file = Path(output_dir) / (Path(filename).stem + ".txt")
with open(label_file, "w") as out:
for bbox in bboxes:
label = bbox["label"]
class_id = class_names.index(label) if label in class_names else 0
cx = (bbox["x"] + bbox["width"] / 2) / img_w
cy = (bbox["y"] + bbox["height"] / 2) / img_h
nw = bbox["width"] / img_w
nh = bbox["height"] / img_h
out.write(f"{class_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
# Usage
convert_to_yolo(
"annotation_output/", "images/", "yolo_labels/",
"objects", class_names=["person", "car", "dog"]
)Pascal-VOC-Format
XML-Format, das von vielen Erkennungs-Frameworks verwendet wird:
<annotation>
<filename>image_001.jpg</filename>
<size>
<width>640</width>
<height>480</height>
</size>
<object>
<name>person</name>
<bndbox>
<xmin>100</xmin>
<ymin>150</ymin>
<xmax>300</xmax>
<ymax>450</ymax>
</bndbox>
</object>
</annotation>Beispiel-Konvertierungsskript, das Pascal-VOC-XML-Dateien aus Potato-Annotationen schreibt:
import json
import xml.etree.ElementTree as ET
from pathlib import Path
from PIL import Image
def convert_to_voc(annotations_dir, images_dir, output_dir, scheme_name="objects"):
"""Convert Potato bounding box annotations to Pascal VOC XML format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w = item_data.get("width", 0)
h = item_data.get("height", 0)
root = ET.Element("annotation")
ET.SubElement(root, "filename").text = filename
size_el = ET.SubElement(root, "size")
ET.SubElement(size_el, "width").text = str(w)
ET.SubElement(size_el, "height").text = str(h)
ET.SubElement(size_el, "depth").text = "3"
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
obj = ET.SubElement(root, "object")
ET.SubElement(obj, "name").text = bbox["label"]
bndbox = ET.SubElement(obj, "bndbox")
ET.SubElement(bndbox, "xmin").text = str(int(bbox["x"]))
ET.SubElement(bndbox, "ymin").text = str(int(bbox["y"]))
ET.SubElement(bndbox, "xmax").text = str(int(bbox["x"] + bbox["width"]))
ET.SubElement(bndbox, "ymax").text = str(int(bbox["y"] + bbox["height"]))
tree = ET.ElementTree(root)
xml_file = Path(output_dir) / (Path(filename).stem + ".xml")
tree.write(xml_file, encoding="unicode", xml_declaration=True)
# Usage
convert_to_voc("annotation_output/", "images/", "voc_annotations/", "objects")Benutzerdefinierte Exportskripte
Grundlegendes Exportskript
import json
import os
from pathlib import Path
def export_annotations(input_dir, output_file, format="json"):
"""Combine all annotator files into a single export."""
all_annotations = []
for file in Path(input_dir).glob("*.json"):
with open(file) as f:
data = json.load(f)
all_annotations.extend(data)
# Deduplicate by ID (keep latest)
by_id = {}
for ann in all_annotations:
by_id[ann["id"]] = ann
with open(output_file, "w") as f:
json.dump(list(by_id.values()), f, indent=2)
# Usage
export_annotations("output/", "combined_annotations.json")Mehrere Annotatoren zusammenführen
from collections import Counter
def aggregate_labels(annotations_dir, scheme_name):
"""Majority vote aggregation for classification tasks."""
from pathlib import Path
import json
# Collect all labels per item
item_labels = {}
for file in Path(annotations_dir).glob("*.json"):
with open(file) as f:
for ann in json.load(f):
item_id = ann["id"]
label = ann["annotations"].get(scheme_name)
if item_id not in item_labels:
item_labels[item_id] = []
item_labels[item_id].append(label)
# Majority vote
aggregated = {}
for item_id, labels in item_labels.items():
counter = Counter(labels)
aggregated[item_id] = counter.most_common(1)[0][0]
return aggregatedInter-Annotator-Übereinstimmung berechnen
from sklearn.metrics import cohen_kappa_score
import numpy as np
def compute_agreement(annotations_dir, scheme_name):
"""Compute Cohen's Kappa for overlapping annotations."""
# Load annotations from two annotators
ann1 = load_annotations(f"{annotations_dir}/user_1.json")
ann2 = load_annotations(f"{annotations_dir}/user_2.json")
# Find overlapping items
common_ids = set(ann1.keys()) & set(ann2.keys())
labels1 = [ann1[id][scheme_name] for id in common_ids]
labels2 = [ann2[id][scheme_name] for id in common_ids]
kappa = cohen_kappa_score(labels1, labels2)
return kappaBewährte Vorgehensweisen
1. Regelmäßig exportieren
Automatisierte Exporte für Backup und Analyse einrichten:
# Add to your workflow
import schedule
def daily_export():
export_annotations("output/", f"exports/annotations_{date.today()}.json")
schedule.every().day.at("18:00").do(daily_export)2. Metadaten einschließen
Kontext in Exporten beibehalten:
export_data = {
"metadata": {
"task_name": "Sentiment Analysis",
"exported_at": datetime.now().isoformat(),
"total_annotations": len(annotations),
"annotators": list(set(a["annotator"] for a in annotations))
},
"annotations": annotations
}3. Exporte validieren
Export-Integrität prüfen:
def validate_export(export_file, original_count):
with open(export_file) as f:
exported = json.load(f)
assert len(exported) == original_count, "Missing annotations"
assert all("id" in a for a in exported), "Missing IDs"
print(f"Export validated: {len(exported)} annotations")4. Exporte versionieren
Zeitstempel oder Versionsnummern verwenden:
exports/
annotations_v1_2024-01-15.json
annotations_v2_2024-01-20.json
annotations_final_2024-01-25.json
Integrationsbeispiele
HuggingFace-Modell trainieren
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
# Load exported data
with open("aggregated_annotations.json") as f:
data = json.load(f)
# Create dataset
dataset = Dataset.from_list([
{"text": item["text"], "label": item["sentiment"]}
for item in data
])
# Train model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
# ... continue with trainingspaCy NER trainieren
import spacy
from spacy.tokens import DocBin
# Load exported spans
with open("ner_annotations.json") as f:
data = json.load(f)
nlp = spacy.blank("en")
doc_bin = DocBin()
for item in data:
doc = nlp.make_doc(item["text"])
ents = []
for span in item["entities"]:
ent = doc.char_span(span["start"], span["end"], label=span["label"])
if ent:
ents.append(ent)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk("./train.spacy")YOLO-Training
# After exporting to YOLO format
yolo train data=dataset.yaml model=yolov8n.pt epochs=100dataset.yaml:
train: ./images/train
val: ./images/val
nc: 3
names: ['person', 'car', 'dog']