Formatos de Exportación
Exporta anotaciones a diversos formatos para frameworks de aprendizaje automático y herramientas de análisis.
Formatos de Exportación
Potato proporciona dos niveles de exportación:
- Exportación nativa - Las anotaciones se guardan automáticamente en formato JSON/JSONL/CSV/TSV según la configuración
- CLI de exportación (Nuevo en v2.2.0) -
python -m potato.exportconvierte anotaciones a formatos especializados (COCO, YOLO, Pascal VOC, CoNLL-2003, CoNLL-U, Mask PNG)
Esta página cubre tanto los formatos integrados como el CLI de exportación, además de scripts de conversión de ejemplo para objetivos comunes.
Formatos de Exportación Básicos
JSON
El formato de salida predeterminado. El trabajo de cada anotador se guarda como un archivo JSON:
{
"id": "doc_001",
"annotations": {
"sentiment": "positive",
"confidence": 4
},
"annotator": "user_1",
"timestamp": "2024-01-15T10:30:00Z"
}Configura en YAML:
output_annotation_format: "json"
output_annotation_dir: "output/"JSON Lines (JSONL)
Una anotación por línea, ideal para streaming y conjuntos de datos grandes:
{"id": "doc_001", "annotations": {"sentiment": "positive"}, "annotator": "user_1"}
{"id": "doc_002", "annotations": {"sentiment": "negative"}, "annotator": "user_1"}output_annotation_format: "jsonl"CSV
Formato tabular para análisis en hojas de cálculo:
id,annotator,sentiment,confidence,timestamp
doc_001,user_1,positive,4,2024-01-15T10:30:00Z
doc_002,user_1,negative,2,2024-01-15T10:31:00Zoutput_annotation_format: "csv"TSV
Valores separados por tabulaciones:
output_annotation_format: "tsv"CLI de Exportación
Nuevo en v2.2.0
El CLI de exportación convierte anotaciones de Potato a formatos especializados con un solo comando:
# List available export formats
python -m potato.export --list-formats
# Export to COCO format
python -m potato.export --config config.yaml --format coco --output ./export/
# Export to YOLO format
python -m potato.export --config config.yaml --format yolo --output ./export/
# Export with options
python -m potato.export --config config.yaml --format coco --output ./export/ \
--option split_ratio=0.8 --option include_unlabeled=falseOpciones del CLI
| Opción | Descripción |
|---|---|
--config, -c | Ruta al archivo de configuración YAML de Potato |
--format, -f | Formato de exportación (coco, yolo, pascal_voc, conll_2003, conll_u, mask) |
--output, -o | Directorio de salida (predeterminado: ./export_output) |
--option | Opción específica del formato como key=value (repetible) |
--list-formats | Listar formatos disponibles y salir |
--verbose, -v | Habilitar registro detallado |
Formatos de Exportación Soportados
| Formato | ID | Mejor Para |
|---|---|---|
| COCO | coco | Detección de objetos, segmentación de instancias |
| YOLO | yolo | Entrenamiento de modelos YOLO |
| Pascal VOC | pascal_voc | Detección de objetos (XML) |
| CoNLL-2003 | conll_2003 | NER, etiquetado de secuencias |
| CoNLL-U | conll_u | Etiquetado POS, análisis de dependencias |
| Máscaras de segmentación | mask | Segmentación semántica/de instancias |
Matriz de Compatibilidad de Formatos
| Tipo de Anotación | COCO | YOLO | Pascal VOC | CoNLL-2003 | CoNLL-U | Mask |
|---|---|---|---|---|---|---|
| Cuadros delimitadores | Sí | Sí | Sí | - | - | - |
| Polígonos | Sí | - | - | - | - | Sí |
| Puntos clave | Sí | - | - | - | - | - |
| Spans de texto | - | - | - | Sí | Sí | - |
| Clasificaciones | Parcial | - | - | - | - | - |
Exportación Programática
Usa el registro de exportación directamente en Python:
from potato.export.registry import export_registry
from potato.export.cli import build_export_context
context = build_export_context("path/to/config.yaml")
result = export_registry.export("coco", context, "./output/")
if result.success:
print(f"Exported {len(result.files_written)} files")Exportadores Personalizados
Crea exportadores personalizados heredando de BaseExporter:
from potato.export.base import BaseExporter, ExportContext, ExportResult
class MyExporter(BaseExporter):
format_name = "my_format"
description = "My custom export format"
file_extensions = [".myformat"]
def can_export(self, context: ExportContext) -> tuple:
has_spans = any(ann.get("spans") for ann in context.annotations)
if not has_spans:
return False, "No span annotations found"
return True, None
def export(self, context: ExportContext, output_path: str,
options: dict = None) -> ExportResult:
# Perform the export
return ExportResult(
success=True,
format_name=self.format_name,
files_written=["output.myformat"],
stats={"annotations": len(context.annotations)}
)
from potato.export.registry import export_registry
export_registry.register(MyExporter())Formatos de Exportación para NLP
Formato CoNLL
Formato estándar para tareas de etiquetado de secuencias (NER, etiquetado POS):
The O
quick B-ADJ
brown I-ADJ
fox B-NOUN
jumps B-VERB
Script de conversión de ejemplo que lee anotaciones de spans de Potato y escribe en formato CoNLL:
import json
from pathlib import Path
def convert_to_conll(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to CoNLL format."""
with open(output_file, "w") as out:
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
tokens = text.split()
labels = ["O"] * len(tokens)
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
start_tok = span.get("start")
end_tok = span.get("end")
label = span.get("label", "ENT")
for i in range(start_tok, min(end_tok, len(tokens))):
prefix = "B" if i == start_tok else "I"
labels[i] = f"{prefix}-{label}"
for token, label in zip(tokens, labels):
out.write(f"{token}\t{label}\n")
out.write("\n")
# Usage
convert_to_conll("annotation_output/", "annotations.conll", "entities")Formato IOB2
Etiquetado Inside-Outside-Beginning para reconocimiento de entidades:
John B-PER
Smith I-PER
works O
at O
Google B-ORG
Formato spaCy
Script de conversión de ejemplo que lee la salida de Potato y crea un DocBin de spaCy para entrenamiento de NER:
import json
import spacy
from spacy.tokens import DocBin
from pathlib import Path
def convert_to_spacy(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to spaCy DocBin format."""
nlp = spacy.blank("en")
doc_bin = DocBin()
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
doc = nlp.make_doc(text)
ents = []
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
char_span = doc.char_span(
span["start_offset"], span["end_offset"],
label=span["label"]
)
if char_span is not None:
ents.append(char_span)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk(output_file)
# Usage
convert_to_spacy("annotation_output/", "train.spacy", "entities")La salida puede usarse directamente con spacy train:
python -m spacy train config.cfg --paths.train ./annotations.spacyHuggingFace Datasets
Script de conversión de ejemplo usando la biblioteca datasets para convertir la salida de Potato en un HuggingFace Dataset:
import json
from pathlib import Path
from datasets import Dataset, DatasetDict
def convert_to_huggingface(annotations_dir, output_dir, scheme_names):
"""Convert Potato annotations to a HuggingFace Dataset."""
records = []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
record = {"id": item_id, "text": item_data.get("text", "")}
annotations = item_data.get("annotations", {})
for scheme in scheme_names:
record[scheme] = annotations.get(scheme)
records.append(record)
dataset = Dataset.from_list(records)
dataset.save_to_disk(output_dir)
print(f"Saved {len(records)} examples to {output_dir}")
# Usage
convert_to_huggingface("annotation_output/", "hf_dataset/", ["sentiment", "entities"])Cárgalo en tu script de entrenamiento:
from datasets import load_from_disk
dataset = load_from_disk("hf_dataset/")Formatos de Exportación para Visión por Computadora
Formato COCO
Formato estándar para detección de objetos y segmentación:
{
"images": [
{"id": 1, "file_name": "image_001.jpg", "width": 640, "height": 480}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 150, 200, 300],
"area": 60000,
"segmentation": [[100, 150, 300, 150, 300, 450, 100, 450]]
}
],
"categories": [
{"id": 1, "name": "person"}
]
}Script de conversión de ejemplo que lee anotaciones de cuadros delimitadores de Potato y escribe JSON en formato COCO:
import json
from pathlib import Path
from PIL import Image
def convert_to_coco(annotations_dir, images_dir, output_file, scheme_name="objects"):
"""Convert Potato bounding box annotations to COCO format."""
coco = {"images": [], "annotations": [], "categories": []}
category_map = {}
ann_id = 1
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for img_idx, (item_id, item_data) in enumerate(data.items(), start=1):
# Get image dimensions
img_path = Path(images_dir) / item_data.get("filename", f"{item_id}.jpg")
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w, h = item_data.get("width", 0), item_data.get("height", 0)
coco["images"].append({
"id": img_idx,
"file_name": img_path.name,
"width": w, "height": h
})
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
label = bbox["label"]
if label not in category_map:
cat_id = len(category_map) + 1
category_map[label] = cat_id
coco["categories"].append({"id": cat_id, "name": label})
x, y = bbox["x"], bbox["y"]
bw, bh = bbox["width"], bbox["height"]
coco["annotations"].append({
"id": ann_id, "image_id": img_idx,
"category_id": category_map[label],
"bbox": [x, y, bw, bh],
"area": bw * bh, "iscrowd": 0
})
ann_id += 1
with open(output_file, "w") as f:
json.dump(coco, f, indent=2)
# Usage
convert_to_coco("annotation_output/", "images/", "coco_annotations.json", "objects")Formato YOLO
Un archivo de texto por imagen para entrenamiento de YOLO:
# class_id center_x center_y width height (normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2
Script de conversión de ejemplo que escribe archivos de etiquetas en formato YOLO desde anotaciones de Potato:
import json
from pathlib import Path
from PIL import Image
def convert_to_yolo(annotations_dir, images_dir, output_dir, scheme_name="objects",
class_names=None):
"""Convert Potato bounding box annotations to YOLO format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
class_names = class_names or []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
img_w, img_h = img.size
else:
img_w = item_data.get("width", 1)
img_h = item_data.get("height", 1)
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
label_file = Path(output_dir) / (Path(filename).stem + ".txt")
with open(label_file, "w") as out:
for bbox in bboxes:
label = bbox["label"]
class_id = class_names.index(label) if label in class_names else 0
cx = (bbox["x"] + bbox["width"] / 2) / img_w
cy = (bbox["y"] + bbox["height"] / 2) / img_h
nw = bbox["width"] / img_w
nh = bbox["height"] / img_h
out.write(f"{class_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
# Usage
convert_to_yolo(
"annotation_output/", "images/", "yolo_labels/",
"objects", class_names=["person", "car", "dog"]
)Formato Pascal VOC
Formato XML utilizado por muchos frameworks de detección:
<annotation>
<filename>image_001.jpg</filename>
<size>
<width>640</width>
<height>480</height>
</size>
<object>
<name>person</name>
<bndbox>
<xmin>100</xmin>
<ymin>150</ymin>
<xmax>300</xmax>
<ymax>450</ymax>
</bndbox>
</object>
</annotation>Script de conversión de ejemplo que escribe archivos XML en formato Pascal VOC desde anotaciones de Potato:
import json
import xml.etree.ElementTree as ET
from pathlib import Path
from PIL import Image
def convert_to_voc(annotations_dir, images_dir, output_dir, scheme_name="objects"):
"""Convert Potato bounding box annotations to Pascal VOC XML format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w = item_data.get("width", 0)
h = item_data.get("height", 0)
root = ET.Element("annotation")
ET.SubElement(root, "filename").text = filename
size_el = ET.SubElement(root, "size")
ET.SubElement(size_el, "width").text = str(w)
ET.SubElement(size_el, "height").text = str(h)
ET.SubElement(size_el, "depth").text = "3"
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
obj = ET.SubElement(root, "object")
ET.SubElement(obj, "name").text = bbox["label"]
bndbox = ET.SubElement(obj, "bndbox")
ET.SubElement(bndbox, "xmin").text = str(int(bbox["x"]))
ET.SubElement(bndbox, "ymin").text = str(int(bbox["y"]))
ET.SubElement(bndbox, "xmax").text = str(int(bbox["x"] + bbox["width"]))
ET.SubElement(bndbox, "ymax").text = str(int(bbox["y"] + bbox["height"]))
tree = ET.ElementTree(root)
xml_file = Path(output_dir) / (Path(filename).stem + ".xml")
tree.write(xml_file, encoding="unicode", xml_declaration=True)
# Usage
convert_to_voc("annotation_output/", "images/", "voc_annotations/", "objects")Scripts de Exportación Personalizados
Script de Exportación Básico
import json
import os
from pathlib import Path
def export_annotations(input_dir, output_file, format="json"):
"""Combine all annotator files into a single export."""
all_annotations = []
for file in Path(input_dir).glob("*.json"):
with open(file) as f:
data = json.load(f)
all_annotations.extend(data)
# Deduplicate by ID (keep latest)
by_id = {}
for ann in all_annotations:
by_id[ann["id"]] = ann
with open(output_file, "w") as f:
json.dump(list(by_id.values()), f, indent=2)
# Usage
export_annotations("output/", "combined_annotations.json")Agregación de Múltiples Anotadores
from collections import Counter
def aggregate_labels(annotations_dir, scheme_name):
"""Majority vote aggregation for classification tasks."""
from pathlib import Path
import json
# Collect all labels per item
item_labels = {}
for file in Path(annotations_dir).glob("*.json"):
with open(file) as f:
for ann in json.load(f):
item_id = ann["id"]
label = ann["annotations"].get(scheme_name)
if item_id not in item_labels:
item_labels[item_id] = []
item_labels[item_id].append(label)
# Majority vote
aggregated = {}
for item_id, labels in item_labels.items():
counter = Counter(labels)
aggregated[item_id] = counter.most_common(1)[0][0]
return aggregatedCálculo del Acuerdo entre Anotadores
from sklearn.metrics import cohen_kappa_score
import numpy as np
def compute_agreement(annotations_dir, scheme_name):
"""Compute Cohen's Kappa for overlapping annotations."""
# Load annotations from two annotators
ann1 = load_annotations(f"{annotations_dir}/user_1.json")
ann2 = load_annotations(f"{annotations_dir}/user_2.json")
# Find overlapping items
common_ids = set(ann1.keys()) & set(ann2.keys())
labels1 = [ann1[id][scheme_name] for id in common_ids]
labels2 = [ann2[id][scheme_name] for id in common_ids]
kappa = cohen_kappa_score(labels1, labels2)
return kappaMejores Prácticas
1. Exportar Regularmente
Configura exportaciones automatizadas para respaldo y análisis:
# Add to your workflow
import schedule
def daily_export():
export_annotations("output/", f"exports/annotations_{date.today()}.json")
schedule.every().day.at("18:00").do(daily_export)2. Incluir Metadatos
Preserva el contexto en las exportaciones:
export_data = {
"metadata": {
"task_name": "Sentiment Analysis",
"exported_at": datetime.now().isoformat(),
"total_annotations": len(annotations),
"annotators": list(set(a["annotator"] for a in annotations))
},
"annotations": annotations
}3. Validar Exportaciones
Verifica la integridad de las exportaciones:
def validate_export(export_file, original_count):
with open(export_file) as f:
exported = json.load(f)
assert len(exported) == original_count, "Missing annotations"
assert all("id" in a for a in exported), "Missing IDs"
print(f"Export validated: {len(exported)} annotations")4. Versionar las Exportaciones
Usa marcas de tiempo o números de versión:
exports/
annotations_v1_2024-01-15.json
annotations_v2_2024-01-20.json
annotations_final_2024-01-25.json
Ejemplos de Integración
Entrenar un Modelo de HuggingFace
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
# Load exported data
with open("aggregated_annotations.json") as f:
data = json.load(f)
# Create dataset
dataset = Dataset.from_list([
{"text": item["text"], "label": item["sentiment"]}
for item in data
])
# Train model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
# ... continue with trainingEntrenar NER con spaCy
import spacy
from spacy.tokens import DocBin
# Load exported spans
with open("ner_annotations.json") as f:
data = json.load(f)
nlp = spacy.blank("en")
doc_bin = DocBin()
for item in data:
doc = nlp.make_doc(item["text"])
ents = []
for span in item["entities"]:
ent = doc.char_span(span["start"], span["end"], label=span["label"])
if ent:
ents.append(ent)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk("./train.spacy")Entrenamiento con YOLO
# After exporting to YOLO format
yolo train data=dataset.yaml model=yolov8n.pt epochs=100dataset.yaml:
train: ./images/train
val: ./images/val
nc: 3
names: ['person', 'car', 'dog']