내보내기 형식

다운스트림 ML 파이프라인을 위해 Potato 주석을 JSON, JSONL, CSV, CoNLL, HuggingFace Datasets, spaCy, COCO, YOLO, Pascal VOC, Apache Parquet 형식으로 내보냅니다.

Potato는 두 가지 수준의 내보내기를 제공합니다.

네이티브 내보내기 - 주석이 설정에 따라 JSON/JSONL/CSV/TSV/Parquet 형식으로 자동 저장됩니다
내보내기 CLI (v2.2.0의 새 기능) - python -m potato.export가 주석을 전문 형식(COCO, YOLO, Pascal VOC, CoNLL-2003, CoNLL-U, Mask PNG, Parquet, EAF, TextGrid)으로 변환합니다

이 페이지에서는 내장 형식과 내보내기 CLI를 모두 다루며, 일반적인 대상을 위한 변환 스크립트 예제도 함께 제공합니다.

기본 내보내기 형식

JSON

기본 출력 형식입니다. 각 주석자의 작업은 JSON 파일로 저장됩니다.

json

{
  "id": "doc_001",
  "annotations": {
    "sentiment": "positive",
    "confidence": 4
  },
  "annotator": "user_1",
  "timestamp": "2024-01-15T10:30:00Z"
}

YAML에서 설정합니다.

yaml

export_annotation_format: "json"
output_annotation_dir: "output/"

JSON Lines (JSONL)

한 줄에 하나의 주석을 담는 형식으로, 스트리밍과 대용량 데이터셋에 적합합니다.

jsonl

{"id": "doc_001", "annotations": {"sentiment": "positive"}, "annotator": "user_1"}
{"id": "doc_002", "annotations": {"sentiment": "negative"}, "annotator": "user_1"}

yaml

export_annotation_format: "jsonl"

CSV

스프레드시트 분석을 위한 표 형식입니다.

csv

id,annotator,sentiment,confidence,timestamp
doc_001,user_1,positive,4,2024-01-15T10:30:00Z
doc_002,user_1,negative,2,2024-01-15T10:31:00Z

yaml

export_annotation_format: "csv"

TSV

탭으로 구분된 값입니다.

yaml

export_annotation_format: "tsv"

내보내기 CLI

v2.2.0의 새 기능

내보내기 CLI는 단일 명령으로 Potato 주석을 전문 형식으로 변환합니다.

bash

# List available export formats
python -m potato.export --list-formats
 
# Export to COCO format
python -m potato.export --config config.yaml --format coco --output ./export/
 
# Export to YOLO format
python -m potato.export --config config.yaml --format yolo --output ./export/
 
# Export with options
python -m potato.export --config config.yaml --format coco --output ./export/ \
    --option split_ratio=0.8 --option include_unlabeled=false

CLI 옵션

옵션	설명
`--config`, `-c`	Potato YAML 설정 파일 경로
`--format`, `-f`	내보내기 형식(coco, yolo, pascal_voc, conll_2003, conll_u, mask, parquet, eaf, textgrid)
`--output`, `-o`	출력 디렉터리(기본값: ./export_output)
`--option`	키=값 형태의 형식별 옵션(반복 가능)
`--list-formats`	사용 가능한 형식을 나열하고 종료
`--verbose`, `-v`	상세 로깅 활성화

지원되는 내보내기 형식

형식	ID	적합한 용도
COCO	`coco`	객체 탐지, 인스턴스 분할
YOLO	`yolo`	YOLO 모델 학습
Pascal VOC	`pascal_voc`	객체 탐지(XML)
CoNLL-2003	`conll_2003`	NER, 시퀀스 레이블링
CoNLL-U	`conll_u`	품사 태깅, 의존 구문 분석
분할 마스크	`mask`	시맨틱/인스턴스 분할
Apache Parquet	`parquet`	대용량 데이터셋을 위한 열 기반 저장
ELAN EAF	`eaf`	음성/영상 주석 타임라인
Praat TextGrid	`textgrid`	음성학/발화 주석

형식 호환성 매트릭스

주석 유형	COCO	YOLO	Pascal VOC	CoNLL-2003	CoNLL-U	Mask
경계 상자	예	예	예	-	-	-
다각형	예	-	-	-	-	예
키포인트	예	-	-	-	-	-
텍스트 스팬	-	-	-	예	예	-
분류	부분 지원	-	-	-	-	-

프로그래밍 방식 내보내기

Python에서 내보내기 레지스트리를 직접 사용합니다.

python

from potato.export.registry import export_registry
from potato.export.cli import build_export_context
 
context = build_export_context("path/to/config.yaml")
result = export_registry.export("coco", context, "./output/")
 
if result.success:
    print(f"Exported {len(result.files_written)} files")

사용자 정의 내보내기 도구

BaseExporter를 서브클래싱하여 사용자 정의 내보내기 도구를 만듭니다.

python

from potato.export.base import BaseExporter, ExportContext, ExportResult
 
class MyExporter(BaseExporter):
    format_name = "my_format"
    description = "My custom export format"
    file_extensions = [".myformat"]
 
    def can_export(self, context: ExportContext) -> tuple:
        has_spans = any(ann.get("spans") for ann in context.annotations)
        if not has_spans:
            return False, "No span annotations found"
        return True, None
 
    def export(self, context: ExportContext, output_path: str,
               options: dict = None) -> ExportResult:
        # Perform the export
        return ExportResult(
            success=True,
            format_name=self.format_name,
            files_written=["output.myformat"],
            stats={"annotations": len(context.annotations)}
        )
 
from potato.export.registry import export_registry
export_registry.register(MyExporter())

NLP 내보내기 형식

CoNLL 형식

시퀀스 레이블링 작업(NER, 품사 태깅)을 위한 표준 형식입니다.

text

The     O
quick   B-ADJ
brown   I-ADJ
fox     B-NOUN
jumps   B-VERB

Potato 스팬 주석을 읽어 CoNLL 형식으로 작성하는 변환 스크립트 예제입니다.

python

import json
from pathlib import Path
 
def convert_to_conll(annotations_dir, output_file, scheme_name="entities"):
    """Convert Potato span annotations to CoNLL format."""
    with open(output_file, "w") as out:
        for file in sorted(Path(annotations_dir).glob("*.json")):
            with open(file) as f:
                data = json.load(f)
 
            for item_id, item_data in data.items():
                text = item_data.get("text", "")
                tokens = text.split()
                labels = ["O"] * len(tokens)
 
                spans = item_data.get("annotations", {}).get(scheme_name, [])
                for span in spans:
                    start_tok = span.get("start")
                    end_tok = span.get("end")
                    label = span.get("label", "ENT")
                    for i in range(start_tok, min(end_tok, len(tokens))):
                        prefix = "B" if i == start_tok else "I"
                        labels[i] = f"{prefix}-{label}"
 
                for token, label in zip(tokens, labels):
                    out.write(f"{token}\t{label}\n")
                out.write("\n")
 
# Usage
convert_to_conll("annotation_output/", "annotations.conll", "entities")

IOB2 형식

엔티티 인식을 위한 Inside-Outside-Beginning 태깅입니다.

text

John    B-PER
Smith   I-PER
works   O
at      O
Google  B-ORG

spaCy 형식

Potato 출력을 읽어 NER 학습용 spaCy DocBin을 생성하는 변환 스크립트 예제입니다.

python

import json
import spacy
from spacy.tokens import DocBin
from pathlib import Path
 
def convert_to_spacy(annotations_dir, output_file, scheme_name="entities"):
    """Convert Potato span annotations to spaCy DocBin format."""
    nlp = spacy.blank("en")
    doc_bin = DocBin()
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            text = item_data.get("text", "")
            doc = nlp.make_doc(text)
            ents = []
 
            spans = item_data.get("annotations", {}).get(scheme_name, [])
            for span in spans:
                char_span = doc.char_span(
                    span["start_offset"], span["end_offset"],
                    label=span["label"]
                )
                if char_span is not None:
                    ents.append(char_span)
            doc.ents = ents
            doc_bin.add(doc)
 
    doc_bin.to_disk(output_file)
 
# Usage
convert_to_spacy("annotation_output/", "train.spacy", "entities")

출력은 spacy train과 함께 바로 사용할 수 있습니다.

bash

python -m spacy train config.cfg --paths.train ./annotations.spacy

HuggingFace Datasets

datasets 라이브러리를 사용하여 Potato 출력을 HuggingFace Dataset으로 변환하는 변환 스크립트 예제입니다.

python

import json
from pathlib import Path
from datasets import Dataset, DatasetDict
 
def convert_to_huggingface(annotations_dir, output_dir, scheme_names):
    """Convert Potato annotations to a HuggingFace Dataset."""
    records = []
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            record = {"id": item_id, "text": item_data.get("text", "")}
            annotations = item_data.get("annotations", {})
            for scheme in scheme_names:
                record[scheme] = annotations.get(scheme)
            records.append(record)
 
    dataset = Dataset.from_list(records)
    dataset.save_to_disk(output_dir)
    print(f"Saved {len(records)} examples to {output_dir}")
 
# Usage
convert_to_huggingface("annotation_output/", "hf_dataset/", ["sentiment", "entities"])

학습 스크립트에서 불러옵니다.

python

from datasets import load_from_disk
 
dataset = load_from_disk("hf_dataset/")

컴퓨터 비전 내보내기 형식

COCO 형식

객체 탐지 및 분할을 위한 표준 형식입니다.

json

{
  "images": [
    {"id": 1, "file_name": "image_001.jpg", "width": 640, "height": 480}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 150, 200, 300],
      "area": 60000,
      "segmentation": [[100, 150, 300, 150, 300, 450, 100, 450]]
    }
  ],
  "categories": [
    {"id": 1, "name": "person"}
  ]
}

Potato 경계 상자 주석을 읽어 COCO JSON을 작성하는 변환 스크립트 예제입니다.

python

import json
from pathlib import Path
from PIL import Image
 
def convert_to_coco(annotations_dir, images_dir, output_file, scheme_name="objects"):
    """Convert Potato bounding box annotations to COCO format."""
    coco = {"images": [], "annotations": [], "categories": []}
    category_map = {}
    ann_id = 1
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for img_idx, (item_id, item_data) in enumerate(data.items(), start=1):
            # Get image dimensions
            img_path = Path(images_dir) / item_data.get("filename", f"{item_id}.jpg")
            if img_path.exists():
                img = Image.open(img_path)
                w, h = img.size
            else:
                w, h = item_data.get("width", 0), item_data.get("height", 0)
 
            coco["images"].append({
                "id": img_idx,
                "file_name": img_path.name,
                "width": w, "height": h
            })
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            for bbox in bboxes:
                label = bbox["label"]
                if label not in category_map:
                    cat_id = len(category_map) + 1
                    category_map[label] = cat_id
                    coco["categories"].append({"id": cat_id, "name": label})
 
                x, y = bbox["x"], bbox["y"]
                bw, bh = bbox["width"], bbox["height"]
                coco["annotations"].append({
                    "id": ann_id, "image_id": img_idx,
                    "category_id": category_map[label],
                    "bbox": [x, y, bw, bh],
                    "area": bw * bh, "iscrowd": 0
                })
                ann_id += 1
 
    with open(output_file, "w") as f:
        json.dump(coco, f, indent=2)
 
# Usage
convert_to_coco("annotation_output/", "images/", "coco_annotations.json", "objects")

YOLO 형식

YOLO 학습을 위해 이미지당 하나의 텍스트 파일을 생성합니다.

text

# class_id center_x center_y width height (normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2

Potato 주석에서 YOLO 형식의 레이블 파일을 작성하는 변환 스크립트 예제입니다.

python

import json
from pathlib import Path
from PIL import Image
 
def convert_to_yolo(annotations_dir, images_dir, output_dir, scheme_name="objects",
                    class_names=None):
    """Convert Potato bounding box annotations to YOLO format."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    class_names = class_names or []
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            filename = item_data.get("filename", f"{item_id}.jpg")
            img_path = Path(images_dir) / filename
            if img_path.exists():
                img = Image.open(img_path)
                img_w, img_h = img.size
            else:
                img_w = item_data.get("width", 1)
                img_h = item_data.get("height", 1)
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            label_file = Path(output_dir) / (Path(filename).stem + ".txt")
 
            with open(label_file, "w") as out:
                for bbox in bboxes:
                    label = bbox["label"]
                    class_id = class_names.index(label) if label in class_names else 0
                    cx = (bbox["x"] + bbox["width"] / 2) / img_w
                    cy = (bbox["y"] + bbox["height"] / 2) / img_h
                    nw = bbox["width"] / img_w
                    nh = bbox["height"] / img_h
                    out.write(f"{class_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
 
# Usage
convert_to_yolo(
    "annotation_output/", "images/", "yolo_labels/",
    "objects", class_names=["person", "car", "dog"]
)

Pascal VOC 형식

많은 탐지 프레임워크에서 사용되는 XML 형식입니다.

xml

<annotation>
  <filename>image_001.jpg</filename>
  <size>
    <width>640</width>
    <height>480</height>
  </size>
  <object>
    <name>person</name>
    <bndbox>
      <xmin>100</xmin>
      <ymin>150</ymin>
      <xmax>300</xmax>
      <ymax>450</ymax>
    </bndbox>
  </object>
</annotation>

Potato 주석에서 Pascal VOC XML 파일을 작성하는 변환 스크립트 예제입니다.

python

import json
import xml.etree.ElementTree as ET
from pathlib import Path
from PIL import Image
 
def convert_to_voc(annotations_dir, images_dir, output_dir, scheme_name="objects"):
    """Convert Potato bounding box annotations to Pascal VOC XML format."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            filename = item_data.get("filename", f"{item_id}.jpg")
            img_path = Path(images_dir) / filename
            if img_path.exists():
                img = Image.open(img_path)
                w, h = img.size
            else:
                w = item_data.get("width", 0)
                h = item_data.get("height", 0)
 
            root = ET.Element("annotation")
            ET.SubElement(root, "filename").text = filename
            size_el = ET.SubElement(root, "size")
            ET.SubElement(size_el, "width").text = str(w)
            ET.SubElement(size_el, "height").text = str(h)
            ET.SubElement(size_el, "depth").text = "3"
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            for bbox in bboxes:
                obj = ET.SubElement(root, "object")
                ET.SubElement(obj, "name").text = bbox["label"]
                bndbox = ET.SubElement(obj, "bndbox")
                ET.SubElement(bndbox, "xmin").text = str(int(bbox["x"]))
                ET.SubElement(bndbox, "ymin").text = str(int(bbox["y"]))
                ET.SubElement(bndbox, "xmax").text = str(int(bbox["x"] + bbox["width"]))
                ET.SubElement(bndbox, "ymax").text = str(int(bbox["y"] + bbox["height"]))
 
            tree = ET.ElementTree(root)
            xml_file = Path(output_dir) / (Path(filename).stem + ".xml")
            tree.write(xml_file, encoding="unicode", xml_declaration=True)
 
# Usage
convert_to_voc("annotation_output/", "images/", "voc_annotations/", "objects")

사용자 정의 내보내기 스크립트

기본 내보내기 스크립트

python

import json
import os
from pathlib import Path
 
def export_annotations(input_dir, output_file, format="json"):
    """Combine all annotator files into a single export."""
    all_annotations = []
 
    for file in Path(input_dir).glob("*.json"):
        with open(file) as f:
            data = json.load(f)
            all_annotations.extend(data)
 
    # Deduplicate by ID (keep latest)
    by_id = {}
    for ann in all_annotations:
        by_id[ann["id"]] = ann
 
    with open(output_file, "w") as f:
        json.dump(list(by_id.values()), f, indent=2)
 
# Usage
export_annotations("output/", "combined_annotations.json")

여러 주석자 집계

python

from collections import Counter
 
def aggregate_labels(annotations_dir, scheme_name):
    """Majority vote aggregation for classification tasks."""
    from pathlib import Path
    import json
 
    # Collect all labels per item
    item_labels = {}
 
    for file in Path(annotations_dir).glob("*.json"):
        with open(file) as f:
            for ann in json.load(f):
                item_id = ann["id"]
                label = ann["annotations"].get(scheme_name)
 
                if item_id not in item_labels:
                    item_labels[item_id] = []
                item_labels[item_id].append(label)
 
    # Majority vote
    aggregated = {}
    for item_id, labels in item_labels.items():
        counter = Counter(labels)
        aggregated[item_id] = counter.most_common(1)[0][0]
 
    return aggregated

주석자 간 일치도 계산

python

from sklearn.metrics import cohen_kappa_score
import numpy as np
 
def compute_agreement(annotations_dir, scheme_name):
    """Compute Cohen's Kappa for overlapping annotations."""
    # Load annotations from two annotators
    ann1 = load_annotations(f"{annotations_dir}/user_1.json")
    ann2 = load_annotations(f"{annotations_dir}/user_2.json")
 
    # Find overlapping items
    common_ids = set(ann1.keys()) & set(ann2.keys())
 
    labels1 = [ann1[id][scheme_name] for id in common_ids]
    labels2 = [ann2[id][scheme_name] for id in common_ids]
 
    kappa = cohen_kappa_score(labels1, labels2)
    return kappa

모범 사례

1. 정기적으로 내보내기

백업과 분석을 위해 자동 내보내기를 설정합니다.

python

# Add to your workflow
import schedule
 
def daily_export():
    export_annotations("output/", f"exports/annotations_{date.today()}.json")
 
schedule.every().day.at("18:00").do(daily_export)

2. 메타데이터 포함

내보내기에 컨텍스트를 보존합니다.

python

export_data = {
    "metadata": {
        "task_name": "Sentiment Analysis",
        "exported_at": datetime.now().isoformat(),
        "total_annotations": len(annotations),
        "annotators": list(set(a["annotator"] for a in annotations))
    },
    "annotations": annotations
}

3. 내보내기 검증

내보내기 무결성을 확인합니다.

python

def validate_export(export_file, original_count):
    with open(export_file) as f:
        exported = json.load(f)
 
    assert len(exported) == original_count, "Missing annotations"
    assert all("id" in a for a in exported), "Missing IDs"
    print(f"Export validated: {len(exported)} annotations")

4. 내보내기 버전 관리

타임스탬프나 버전 번호를 사용합니다.

text

exports/
  annotations_v1_2024-01-15.json
  annotations_v2_2024-01-20.json
  annotations_final_2024-01-25.json

통합 예제

HuggingFace 모델 학습

python

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
 
# Load exported data
with open("aggregated_annotations.json") as f:
    data = json.load(f)
 
# Create dataset
dataset = Dataset.from_list([
    {"text": item["text"], "label": item["sentiment"]}
    for item in data
])
 
# Train model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
 
# ... continue with training

spaCy NER 학습

python

import spacy
from spacy.tokens import DocBin
 
# Load exported spans
with open("ner_annotations.json") as f:
    data = json.load(f)
 
nlp = spacy.blank("en")
doc_bin = DocBin()
 
for item in data:
    doc = nlp.make_doc(item["text"])
    ents = []
    for span in item["entities"]:
        ent = doc.char_span(span["start"], span["end"], label=span["label"])
        if ent:
            ents.append(ent)
    doc.ents = ents
    doc_bin.add(doc)
 
doc_bin.to_disk("./train.spacy")

YOLO 학습

bash

# After exporting to YOLO format
yolo train data=dataset.yaml model=yolov8n.pt epochs=100

dataset.yaml:

yaml

train: ./images/train
val: ./images/val
nc: 3
names: ['person', 'car', 'dog']