Diese Seite ist in Ihrer Sprache noch nicht verfügbar. Englische Version wird angezeigt.

Export Formats

ML frameworks और analysis tools के लिए annotations को विभिन्न formats में export करें।

Export Formats

Potato दो स्तरों का export प्रदान करता है:

Native Export - Annotations कॉन्फ़िगर के अनुसार JSON/JSONL/CSV/TSV format में स्वचालित रूप से सहेजी जाती हैं
Export CLI (v2.2.0 में नया) - python -m potato.export annotations को specialized formats (COCO, YOLO, Pascal VOC, CoNLL-2003, CoNLL-U, Mask PNG) में convert करता है

यह पृष्ठ built-in formats और export CLI दोनों को cover करता है, साथ ही सामान्य targets के लिए उदाहरण conversion scripts भी।

बुनियादी Export Formats

JSON

Default output format। प्रत्येक annotator का काम एक JSON file के रूप में सहेजा जाता है:

json

{
  "id": "doc_001",
  "annotations": {
    "sentiment": "positive",
    "confidence": 4
  },
  "annotator": "user_1",
  "timestamp": "2024-01-15T10:30:00Z"
}

YAML में configure करें:

yaml

output_annotation_format: "json"
output_annotation_dir: "output/"

JSON Lines (JSONL)

प्रति पंक्ति एक annotation, streaming और बड़े datasets के लिए आदर्श:

jsonl

{"id": "doc_001", "annotations": {"sentiment": "positive"}, "annotator": "user_1"}
{"id": "doc_002", "annotations": {"sentiment": "negative"}, "annotator": "user_1"}

yaml

output_annotation_format: "jsonl"

CSV

Spreadsheet विश्लेषण के लिए Tabular format:

csv

id,annotator,sentiment,confidence,timestamp
doc_001,user_1,positive,4,2024-01-15T10:30:00Z
doc_002,user_1,negative,2,2024-01-15T10:31:00Z

yaml

output_annotation_format: "csv"

TSV

Tab-separated values:

yaml

output_annotation_format: "tsv"

Export CLI

v2.2.0 में नया

Export CLI एक single command से Potato annotations को specialized formats में convert करता है:

bash

# List available export formats
python -m potato.export --list-formats
 
# Export to COCO format
python -m potato.export --config config.yaml --format coco --output ./export/
 
# Export to YOLO format
python -m potato.export --config config.yaml --format yolo --output ./export/
 
# Export with options
python -m potato.export --config config.yaml --format coco --output ./export/ \
    --option split_ratio=0.8 --option include_unlabeled=false

CLI Options

Option	विवरण
`--config`, `-c`	Potato YAML config file का path
`--format`, `-f`	Export format (coco, yolo, pascal_voc, conll_2003, conll_u, mask)
`--output`, `-o`	Output directory (default: ./export_output)
`--option`	Format-specific option key=value के रूप में (repeatable)
`--list-formats`	उपलब्ध formats सूचीबद्ध करें और बाहर निकलें
`--verbose`, `-v`	Verbose logging सक्षम करें

समर्थित Export Formats

Format	ID	सर्वश्रेष्ठ उपयोग
COCO	`coco`	Object detection, instance segmentation
YOLO	`yolo`	YOLO model training
Pascal VOC	`pascal_voc`	Object detection (XML)
CoNLL-2003	`conll_2003`	NER, sequence labeling
CoNLL-U	`conll_u`	POS tagging, dependency parsing
Segmentation Masks	`mask`	Semantic/instance segmentation

Format Compatibility Matrix

Annotation Type	COCO	YOLO	Pascal VOC	CoNLL-2003	CoNLL-U	Mask
Bounding boxes	हाँ	हाँ	हाँ	-	-	-
Polygons	हाँ	-	-	-	-	हाँ
Keypoints	हाँ	-	-	-	-	-
Text spans	-	-	-	हाँ	हाँ	-
Classifications	आंशिक	-	-	-	-	-

Programmatic Export

Python में export registry का सीधे उपयोग करें:

python

from potato.export.registry import export_registry
from potato.export.cli import build_export_context
 
context = build_export_context("path/to/config.yaml")
result = export_registry.export("coco", context, "./output/")
 
if result.success:
    print(f"Exported {len(result.files_written)} files")

Custom Exporters

BaseExporter को subclass करके custom exporters बनाएँ:

python

from potato.export.base import BaseExporter, ExportContext, ExportResult
 
class MyExporter(BaseExporter):
    format_name = "my_format"
    description = "My custom export format"
    file_extensions = [".myformat"]
 
    def can_export(self, context: ExportContext) -> tuple:
        has_spans = any(ann.get("spans") for ann in context.annotations)
        if not has_spans:
            return False, "No span annotations found"
        return True, None
 
    def export(self, context: ExportContext, output_path: str,
               options: dict = None) -> ExportResult:
        # Perform the export
        return ExportResult(
            success=True,
            format_name=self.format_name,
            files_written=["output.myformat"],
            stats={"annotations": len(context.annotations)}
        )
 
from potato.export.registry import export_registry
export_registry.register(MyExporter())

NLP Export Formats

CoNLL Format

Sequence labeling tasks (NER, POS tagging) के लिए Standard format:

text

The     O
quick   B-ADJ
brown   I-ADJ
fox     B-NOUN
jumps   B-VERB

Potato span annotations पढ़कर CoNLL format लिखने वाला उदाहरण conversion script:

python

import json
from pathlib import Path
 
def convert_to_conll(annotations_dir, output_file, scheme_name="entities"):
    """Convert Potato span annotations to CoNLL format."""
    with open(output_file, "w") as out:
        for file in sorted(Path(annotations_dir).glob("*.json")):
            with open(file) as f:
                data = json.load(f)
 
            for item_id, item_data in data.items():
                text = item_data.get("text", "")
                tokens = text.split()
                labels = ["O"] * len(tokens)
 
                spans = item_data.get("annotations", {}).get(scheme_name, [])
                for span in spans:
                    start_tok = span.get("start")
                    end_tok = span.get("end")
                    label = span.get("label", "ENT")
                    for i in range(start_tok, min(end_tok, len(tokens))):
                        prefix = "B" if i == start_tok else "I"
                        labels[i] = f"{prefix}-{label}"
 
                for token, label in zip(tokens, labels):
                    out.write(f"{token}\t{label}\n")
                out.write("\n")
 
# Usage
convert_to_conll("annotation_output/", "annotations.conll", "entities")

IOB2 Format

Entity recognition के लिए Inside-Outside-Beginning tagging:

text

John    B-PER
Smith   I-PER
works   O
at      O
Google  B-ORG

spaCy Format

Potato output पढ़कर NER training के लिए spaCy DocBin बनाने वाला उदाहरण conversion script:

python

import json
import spacy
from spacy.tokens import DocBin
from pathlib import Path
 
def convert_to_spacy(annotations_dir, output_file, scheme_name="entities"):
    """Convert Potato span annotations to spaCy DocBin format."""
    nlp = spacy.blank("en")
    doc_bin = DocBin()
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            text = item_data.get("text", "")
            doc = nlp.make_doc(text)
            ents = []
 
            spans = item_data.get("annotations", {}).get(scheme_name, [])
            for span in spans:
                char_span = doc.char_span(
                    span["start_offset"], span["end_offset"],
                    label=span["label"]
                )
                if char_span is not None:
                    ents.append(char_span)
            doc.ents = ents
            doc_bin.add(doc)
 
    doc_bin.to_disk(output_file)
 
# Usage
convert_to_spacy("annotation_output/", "train.spacy", "entities")

Output को सीधे spacy train के साथ उपयोग किया जा सकता है:

bash

python -m spacy train config.cfg --paths.train ./annotations.spacy

HuggingFace Datasets

Potato output को HuggingFace Dataset में convert करने के लिए datasets library का उपयोग करने वाला उदाहरण conversion script:

python

import json
from pathlib import Path
from datasets import Dataset, DatasetDict
 
def convert_to_huggingface(annotations_dir, output_dir, scheme_names):
    """Convert Potato annotations to a HuggingFace Dataset."""
    records = []
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            record = {"id": item_id, "text": item_data.get("text", "")}
            annotations = item_data.get("annotations", {})
            for scheme in scheme_names:
                record[scheme] = annotations.get(scheme)
            records.append(record)
 
    dataset = Dataset.from_list(records)
    dataset.save_to_disk(output_dir)
    print(f"Saved {len(records)} examples to {output_dir}")
 
# Usage
convert_to_huggingface("annotation_output/", "hf_dataset/", ["sentiment", "entities"])

अपने training script में load करें:

python

from datasets import load_from_disk
 
dataset = load_from_disk("hf_dataset/")

Computer Vision Export Formats

COCO Format

Object detection और segmentation के लिए Standard format:

json

{
  "images": [
    {"id": 1, "file_name": "image_001.jpg", "width": 640, "height": 480}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 150, 200, 300],
      "area": 60000,
      "segmentation": [[100, 150, 300, 150, 300, 450, 100, 450]]
    }
  ],
  "categories": [
    {"id": 1, "name": "person"}
  ]
}

Potato bounding box annotations पढ़कर COCO JSON लिखने वाला उदाहरण conversion script:

python

import json
from pathlib import Path
from PIL import Image
 
def convert_to_coco(annotations_dir, images_dir, output_file, scheme_name="objects"):
    """Convert Potato bounding box annotations to COCO format."""
    coco = {"images": [], "annotations": [], "categories": []}
    category_map = {}
    ann_id = 1
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for img_idx, (item_id, item_data) in enumerate(data.items(), start=1):
            img_path = Path(images_dir) / item_data.get("filename", f"{item_id}.jpg")
            if img_path.exists():
                img = Image.open(img_path)
                w, h = img.size
            else:
                w, h = item_data.get("width", 0), item_data.get("height", 0)
 
            coco["images"].append({
                "id": img_idx,
                "file_name": img_path.name,
                "width": w, "height": h
            })
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            for bbox in bboxes:
                label = bbox["label"]
                if label not in category_map:
                    cat_id = len(category_map) + 1
                    category_map[label] = cat_id
                    coco["categories"].append({"id": cat_id, "name": label})
 
                x, y = bbox["x"], bbox["y"]
                bw, bh = bbox["width"], bbox["height"]
                coco["annotations"].append({
                    "id": ann_id, "image_id": img_idx,
                    "category_id": category_map[label],
                    "bbox": [x, y, bw, bh],
                    "area": bw * bh, "iscrowd": 0
                })
                ann_id += 1
 
    with open(output_file, "w") as f:
        json.dump(coco, f, indent=2)
 
# Usage
convert_to_coco("annotation_output/", "images/", "coco_annotations.json", "objects")

YOLO Format

YOLO training के लिए प्रति image एक text file:

text

# class_id center_x center_y width height (normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2

Potato annotations से YOLO-format label files लिखने वाला उदाहरण conversion script:

python

import json
from pathlib import Path
from PIL import Image
 
def convert_to_yolo(annotations_dir, images_dir, output_dir, scheme_name="objects",
                    class_names=None):
    """Convert Potato bounding box annotations to YOLO format."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    class_names = class_names or []
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            filename = item_data.get("filename", f"{item_id}.jpg")
            img_path = Path(images_dir) / filename
            if img_path.exists():
                img = Image.open(img_path)
                img_w, img_h = img.size
            else:
                img_w = item_data.get("width", 1)
                img_h = item_data.get("height", 1)
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            label_file = Path(output_dir) / (Path(filename).stem + ".txt")
 
            with open(label_file, "w") as out:
                for bbox in bboxes:
                    label = bbox["label"]
                    class_id = class_names.index(label) if label in class_names else 0
                    cx = (bbox["x"] + bbox["width"] / 2) / img_w
                    cy = (bbox["y"] + bbox["height"] / 2) / img_h
                    nw = bbox["width"] / img_w
                    nh = bbox["height"] / img_h
                    out.write(f"{class_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
 
# Usage
convert_to_yolo(
    "annotation_output/", "images/", "yolo_labels/",
    "objects", class_names=["person", "car", "dog"]
)

Pascal VOC Format

कई detection frameworks द्वारा उपयोग किया जाने वाला XML format:

xml

<annotation>
  <filename>image_001.jpg</filename>
  <size>
    <width>640</width>
    <height>480</height>
  </size>
  <object>
    <name>person</name>
    <bndbox>
      <xmin>100</xmin>
      <ymin>150</ymin>
      <xmax>300</xmax>
      <ymax>450</ymax>
    </bndbox>
  </object>
</annotation>

Potato annotations से Pascal VOC XML files लिखने वाला उदाहरण conversion script:

python

import json
import xml.etree.ElementTree as ET
from pathlib import Path
from PIL import Image
 
def convert_to_voc(annotations_dir, images_dir, output_dir, scheme_name="objects"):
    """Convert Potato bounding box annotations to Pascal VOC XML format."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            filename = item_data.get("filename", f"{item_id}.jpg")
            img_path = Path(images_dir) / filename
            if img_path.exists():
                img = Image.open(img_path)
                w, h = img.size
            else:
                w = item_data.get("width", 0)
                h = item_data.get("height", 0)
 
            root = ET.Element("annotation")
            ET.SubElement(root, "filename").text = filename
            size_el = ET.SubElement(root, "size")
            ET.SubElement(size_el, "width").text = str(w)
            ET.SubElement(size_el, "height").text = str(h)
            ET.SubElement(size_el, "depth").text = "3"
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            for bbox in bboxes:
                obj = ET.SubElement(root, "object")
                ET.SubElement(obj, "name").text = bbox["label"]
                bndbox = ET.SubElement(obj, "bndbox")
                ET.SubElement(bndbox, "xmin").text = str(int(bbox["x"]))
                ET.SubElement(bndbox, "ymin").text = str(int(bbox["y"]))
                ET.SubElement(bndbox, "xmax").text = str(int(bbox["x"] + bbox["width"]))
                ET.SubElement(bndbox, "ymax").text = str(int(bbox["y"] + bbox["height"]))
 
            tree = ET.ElementTree(root)
            xml_file = Path(output_dir) / (Path(filename).stem + ".xml")
            tree.write(xml_file, encoding="unicode", xml_declaration=True)
 
# Usage
convert_to_voc("annotation_output/", "images/", "voc_annotations/", "objects")

Custom Export Scripts

बुनियादी Export Script

python

import json
import os
from pathlib import Path
 
def export_annotations(input_dir, output_file, format="json"):
    """Combine all annotator files into a single export."""
    all_annotations = []
 
    for file in Path(input_dir).glob("*.json"):
        with open(file) as f:
            data = json.load(f)
            all_annotations.extend(data)
 
    # Deduplicate by ID (keep latest)
    by_id = {}
    for ann in all_annotations:
        by_id[ann["id"]] = ann
 
    with open(output_file, "w") as f:
        json.dump(list(by_id.values()), f, indent=2)
 
# Usage
export_annotations("output/", "combined_annotations.json")

कई Annotators का Aggregation

python

from collections import Counter
 
def aggregate_labels(annotations_dir, scheme_name):
    """Majority vote aggregation for classification tasks."""
    from pathlib import Path
    import json
 
    item_labels = {}
 
    for file in Path(annotations_dir).glob("*.json"):
        with open(file) as f:
            for ann in json.load(f):
                item_id = ann["id"]
                label = ann["annotations"].get(scheme_name)
 
                if item_id not in item_labels:
                    item_labels[item_id] = []
                item_labels[item_id].append(label)
 
    aggregated = {}
    for item_id, labels in item_labels.items():
        counter = Counter(labels)
        aggregated[item_id] = counter.most_common(1)[0][0]
 
    return aggregated

Inter-Annotator Agreement की गणना

python

from sklearn.metrics import cohen_kappa_score
import numpy as np
 
def compute_agreement(annotations_dir, scheme_name):
    """Compute Cohen's Kappa for overlapping annotations."""
    ann1 = load_annotations(f"{annotations_dir}/user_1.json")
    ann2 = load_annotations(f"{annotations_dir}/user_2.json")
 
    common_ids = set(ann1.keys()) & set(ann2.keys())
 
    labels1 = [ann1[id][scheme_name] for id in common_ids]
    labels2 = [ann2[id][scheme_name] for id in common_ids]
 
    kappa = cohen_kappa_score(labels1, labels2)
    return kappa

सर्वोत्तम प्रथाएँ

1. नियमित रूप से Export करें

Backup और विश्लेषण के लिए स्वचालित exports सेट करें:

python

import schedule
 
def daily_export():
    export_annotations("output/", f"exports/annotations_{date.today()}.json")
 
schedule.every().day.at("18:00").do(daily_export)

2. Metadata शामिल करें

Exports में context सुरक्षित रखें:

python

export_data = {
    "metadata": {
        "task_name": "Sentiment Analysis",
        "exported_at": datetime.now().isoformat(),
        "total_annotations": len(annotations),
        "annotators": list(set(a["annotator"] for a in annotations))
    },
    "annotations": annotations
}

3. Exports Validate करें

Export integrity जांचें:

python

def validate_export(export_file, original_count):
    with open(export_file) as f:
        exported = json.load(f)
 
    assert len(exported) == original_count, "Missing annotations"
    assert all("id" in a for a in exported), "Missing IDs"
    print(f"Export validated: {len(exported)} annotations")

4. अपने Exports Version करें

Timestamps या version numbers का उपयोग करें:

text

exports/
  annotations_v1_2024-01-15.json
  annotations_v2_2024-01-20.json
  annotations_final_2024-01-25.json

Integration उदाहरण

HuggingFace Model Training

python

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
 
with open("aggregated_annotations.json") as f:
    data = json.load(f)
 
dataset = Dataset.from_list([
    {"text": item["text"], "label": item["sentiment"]}
    for item in data
])
 
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
 
# ... continue with training

spaCy NER Training

python

import spacy
from spacy.tokens import DocBin
 
with open("ner_annotations.json") as f:
    data = json.load(f)
 
nlp = spacy.blank("en")
doc_bin = DocBin()
 
for item in data:
    doc = nlp.make_doc(item["text"])
    ents = []
    for span in item["entities"]:
        ent = doc.char_span(span["start"], span["end"], label=span["label"])
        if ent:
            ents.append(ent)
    doc.ents = ents
    doc_bin.add(doc)
 
doc_bin.to_disk("./train.spacy")

YOLO Training

bash

# After exporting to YOLO format
yolo train data=dataset.yaml model=yolov8n.pt epochs=100

dataset.yaml:

yaml

train: ./images/train
val: ./images/val
nc: 3
names: ['person', 'car', 'dog']