صيغ التصدير

تصدير التعليقات التوضيحية إلى صيغ متنوعة لأطر عمل التعلم الآلي وأدوات التحليل.

صيغ التصدير

يوفر Potato مستويين من التصدير:

التصدير الأصلي - يتم حفظ التعليقات التوضيحية تلقائياً بتنسيق JSON/JSONL/CSV/TSV حسب الإعدادات
واجهة سطر أوامر التصدير (جديد في v2.2.0) - python -m potato.export يحوّل التعليقات التوضيحية إلى صيغ متخصصة (COCO, YOLO, Pascal VOC, CoNLL-2003, CoNLL-U, Mask PNG)

تغطي هذه الصفحة كلاً من الصيغ المدمجة وواجهة سطر أوامر التصدير، بالإضافة إلى أمثلة على نصوص التحويل البرمجية للأهداف الشائعة.

صيغ التصدير الأساسية

JSON

صيغة المخرجات الافتراضية. يتم حفظ عمل كل معلّق كملف JSON:

json

{
  "id": "doc_001",
  "annotations": {
    "sentiment": "positive",
    "confidence": 4
  },
  "annotator": "user_1",
  "timestamp": "2024-01-15T10:30:00Z"
}

الإعداد في YAML:

yaml

output_annotation_format: "json"
output_annotation_dir: "output/"

JSON Lines (JSONL)

تعليق توضيحي واحد لكل سطر، مثالي للبث ومجموعات البيانات الكبيرة:

jsonl

{"id": "doc_001", "annotations": {"sentiment": "positive"}, "annotator": "user_1"}
{"id": "doc_002", "annotations": {"sentiment": "negative"}, "annotator": "user_1"}

yaml

output_annotation_format: "jsonl"

CSV

صيغة جدولية لتحليل جداول البيانات:

csv

id,annotator,sentiment,confidence,timestamp
doc_001,user_1,positive,4,2024-01-15T10:30:00Z
doc_002,user_1,negative,2,2024-01-15T10:31:00Z

yaml

output_annotation_format: "csv"

TSV

قيم مفصولة بعلامات الجدولة:

yaml

output_annotation_format: "tsv"

واجهة سطر أوامر التصدير

جديد في v2.2.0

تحوّل واجهة سطر أوامر التصدير تعليقات Potato التوضيحية إلى صيغ متخصصة بأمر واحد:

bash

# List available export formats
python -m potato.export --list-formats
 
# Export to COCO format
python -m potato.export --config config.yaml --format coco --output ./export/
 
# Export to YOLO format
python -m potato.export --config config.yaml --format yolo --output ./export/
 
# Export with options
python -m potato.export --config config.yaml --format coco --output ./export/ \
    --option split_ratio=0.8 --option include_unlabeled=false

خيارات واجهة سطر الأوامر

الخيار	الوصف
`--config`, `-c`	مسار ملف إعدادات Potato بتنسيق YAML
`--format`, `-f`	صيغة التصدير (coco, yolo, pascal_voc, conll_2003, conll_u, mask)
`--output`, `-o`	مجلد المخرجات (الافتراضي: ./export_output)
`--option`	خيار خاص بالصيغة بتنسيق key=value (قابل للتكرار)
`--list-formats`	عرض الصيغ المتاحة والخروج
`--verbose`, `-v`	تفعيل التسجيل المفصل

صيغ التصدير المدعومة

الصيغة	المعرّف	الأفضل لـ
COCO	`coco`	اكتشاف الكائنات، تجزئة المثيلات
YOLO	`yolo`	تدريب نماذج YOLO
Pascal VOC	`pascal_voc`	اكتشاف الكائنات (XML)
CoNLL-2003	`conll_2003`	التعرف على الكيانات المسماة، تسمية التسلسلات
CoNLL-U	`conll_u`	وسم أجزاء الكلام، تحليل التبعيات
أقنعة التجزئة	`mask`	التجزئة الدلالية/تجزئة المثيلات

مصفوفة التوافق بين الصيغ

نوع التعليق التوضيحي	COCO	YOLO	Pascal VOC	CoNLL-2003	CoNLL-U	Mask
مربعات الإحاطة	نعم	نعم	نعم	-	-	-
المضلعات	نعم	-	-	-	-	نعم
النقاط المفتاحية	نعم	-	-	-	-	-
نطاقات النص	-	-	-	نعم	نعم	-
التصنيفات	جزئي	-	-	-	-	-

التصدير البرمجي

استخدم سجل التصدير مباشرة في Python:

python

from potato.export.registry import export_registry
from potato.export.cli import build_export_context
 
context = build_export_context("path/to/config.yaml")
result = export_registry.export("coco", context, "./output/")
 
if result.success:
    print(f"Exported {len(result.files_written)} files")

المصدّرات المخصصة

أنشئ مصدّرات مخصصة بالوراثة من BaseExporter:

python

from potato.export.base import BaseExporter, ExportContext, ExportResult
 
class MyExporter(BaseExporter):
    format_name = "my_format"
    description = "My custom export format"
    file_extensions = [".myformat"]
 
    def can_export(self, context: ExportContext) -> tuple:
        has_spans = any(ann.get("spans") for ann in context.annotations)
        if not has_spans:
            return False, "No span annotations found"
        return True, None
 
    def export(self, context: ExportContext, output_path: str,
               options: dict = None) -> ExportResult:
        # Perform the export
        return ExportResult(
            success=True,
            format_name=self.format_name,
            files_written=["output.myformat"],
            stats={"annotations": len(context.annotations)}
        )
 
from potato.export.registry import export_registry
export_registry.register(MyExporter())

صيغ تصدير معالجة اللغات الطبيعية

صيغة CoNLL

الصيغة القياسية لمهام تسمية التسلسلات (التعرف على الكيانات المسماة، وسم أجزاء الكلام):

text

The     O
quick   B-ADJ
brown   I-ADJ
fox     B-NOUN
jumps   B-VERB

نص تحويل برمجي يقرأ تعليقات النطاق من Potato ويكتب بصيغة CoNLL:

python

import json
from pathlib import Path
 
def convert_to_conll(annotations_dir, output_file, scheme_name="entities"):
    """Convert Potato span annotations to CoNLL format."""
    with open(output_file, "w") as out:
        for file in sorted(Path(annotations_dir).glob("*.json")):
            with open(file) as f:
                data = json.load(f)
 
            for item_id, item_data in data.items():
                text = item_data.get("text", "")
                tokens = text.split()
                labels = ["O"] * len(tokens)
 
                spans = item_data.get("annotations", {}).get(scheme_name, [])
                for span in spans:
                    start_tok = span.get("start")
                    end_tok = span.get("end")
                    label = span.get("label", "ENT")
                    for i in range(start_tok, min(end_tok, len(tokens))):
                        prefix = "B" if i == start_tok else "I"
                        labels[i] = f"{prefix}-{label}"
 
                for token, label in zip(tokens, labels):
                    out.write(f"{token}\t{label}\n")
                out.write("\n")
 
# Usage
convert_to_conll("annotation_output/", "annotations.conll", "entities")

صيغة IOB2

وسم داخل-خارج-بداية للتعرف على الكيانات:

text

John    B-PER
Smith   I-PER
works   O
at      O
Google  B-ORG

صيغة spaCy

نص تحويل برمجي يقرأ مخرجات Potato وينشئ spaCy DocBin لتدريب التعرف على الكيانات المسماة:

python

import json
import spacy
from spacy.tokens import DocBin
from pathlib import Path
 
def convert_to_spacy(annotations_dir, output_file, scheme_name="entities"):
    """Convert Potato span annotations to spaCy DocBin format."""
    nlp = spacy.blank("en")
    doc_bin = DocBin()
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            text = item_data.get("text", "")
            doc = nlp.make_doc(text)
            ents = []
 
            spans = item_data.get("annotations", {}).get(scheme_name, [])
            for span in spans:
                char_span = doc.char_span(
                    span["start_offset"], span["end_offset"],
                    label=span["label"]
                )
                if char_span is not None:
                    ents.append(char_span)
            doc.ents = ents
            doc_bin.add(doc)
 
    doc_bin.to_disk(output_file)
 
# Usage
convert_to_spacy("annotation_output/", "train.spacy", "entities")

يمكن استخدام المخرجات مباشرة مع spacy train:

bash

python -m spacy train config.cfg --paths.train ./annotations.spacy

مجموعات بيانات HuggingFace

نص تحويل برمجي يستخدم مكتبة datasets لتحويل مخرجات Potato إلى مجموعة بيانات HuggingFace:

python

import json
from pathlib import Path
from datasets import Dataset, DatasetDict
 
def convert_to_huggingface(annotations_dir, output_dir, scheme_names):
    """Convert Potato annotations to a HuggingFace Dataset."""
    records = []
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            record = {"id": item_id, "text": item_data.get("text", "")}
            annotations = item_data.get("annotations", {})
            for scheme in scheme_names:
                record[scheme] = annotations.get(scheme)
            records.append(record)
 
    dataset = Dataset.from_list(records)
    dataset.save_to_disk(output_dir)
    print(f"Saved {len(records)} examples to {output_dir}")
 
# Usage
convert_to_huggingface("annotation_output/", "hf_dataset/", ["sentiment", "entities"])

تحميل في نص التدريب البرمجي:

python

from datasets import load_from_disk
 
dataset = load_from_disk("hf_dataset/")

صيغ تصدير الرؤية الحاسوبية

صيغة COCO

الصيغة القياسية لاكتشاف الكائنات والتجزئة:

json

{
  "images": [
    {"id": 1, "file_name": "image_001.jpg", "width": 640, "height": 480}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 150, 200, 300],
      "area": 60000,
      "segmentation": [[100, 150, 300, 150, 300, 450, 100, 450]]
    }
  ],
  "categories": [
    {"id": 1, "name": "person"}
  ]
}

نص تحويل برمجي يقرأ تعليقات مربعات الإحاطة من Potato ويكتب بصيغة COCO JSON:

python

import json
from pathlib import Path
from PIL import Image
 
def convert_to_coco(annotations_dir, images_dir, output_file, scheme_name="objects"):
    """Convert Potato bounding box annotations to COCO format."""
    coco = {"images": [], "annotations": [], "categories": []}
    category_map = {}
    ann_id = 1
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for img_idx, (item_id, item_data) in enumerate(data.items(), start=1):
            # Get image dimensions
            img_path = Path(images_dir) / item_data.get("filename", f"{item_id}.jpg")
            if img_path.exists():
                img = Image.open(img_path)
                w, h = img.size
            else:
                w, h = item_data.get("width", 0), item_data.get("height", 0)
 
            coco["images"].append({
                "id": img_idx,
                "file_name": img_path.name,
                "width": w, "height": h
            })
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            for bbox in bboxes:
                label = bbox["label"]
                if label not in category_map:
                    cat_id = len(category_map) + 1
                    category_map[label] = cat_id
                    coco["categories"].append({"id": cat_id, "name": label})
 
                x, y = bbox["x"], bbox["y"]
                bw, bh = bbox["width"], bbox["height"]
                coco["annotations"].append({
                    "id": ann_id, "image_id": img_idx,
                    "category_id": category_map[label],
                    "bbox": [x, y, bw, bh],
                    "area": bw * bh, "iscrowd": 0
                })
                ann_id += 1
 
    with open(output_file, "w") as f:
        json.dump(coco, f, indent=2)
 
# Usage
convert_to_coco("annotation_output/", "images/", "coco_annotations.json", "objects")

صيغة YOLO

ملف نصي واحد لكل صورة لتدريب YOLO:

text

# class_id center_x center_y width height (normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2

نص تحويل برمجي يكتب ملفات تسميات بصيغة YOLO من تعليقات Potato التوضيحية:

python

import json
from pathlib import Path
from PIL import Image
 
def convert_to_yolo(annotations_dir, images_dir, output_dir, scheme_name="objects",
                    class_names=None):
    """Convert Potato bounding box annotations to YOLO format."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    class_names = class_names or []
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            filename = item_data.get("filename", f"{item_id}.jpg")
            img_path = Path(images_dir) / filename
            if img_path.exists():
                img = Image.open(img_path)
                img_w, img_h = img.size
            else:
                img_w = item_data.get("width", 1)
                img_h = item_data.get("height", 1)
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            label_file = Path(output_dir) / (Path(filename).stem + ".txt")
 
            with open(label_file, "w") as out:
                for bbox in bboxes:
                    label = bbox["label"]
                    class_id = class_names.index(label) if label in class_names else 0
                    cx = (bbox["x"] + bbox["width"] / 2) / img_w
                    cy = (bbox["y"] + bbox["height"] / 2) / img_h
                    nw = bbox["width"] / img_w
                    nh = bbox["height"] / img_h
                    out.write(f"{class_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
 
# Usage
convert_to_yolo(
    "annotation_output/", "images/", "yolo_labels/",
    "objects", class_names=["person", "car", "dog"]
)

صيغة Pascal VOC

صيغة XML تستخدمها العديد من أطر الاكتشاف:

xml

<annotation>
  <filename>image_001.jpg</filename>
  <size>
    <width>640</width>
    <height>480</height>
  </size>
  <object>
    <name>person</name>
    <bndbox>
      <xmin>100</xmin>
      <ymin>150</ymin>
      <xmax>300</xmax>
      <ymax>450</ymax>
    </bndbox>
  </object>
</annotation>

نص تحويل برمجي يكتب ملفات Pascal VOC XML من تعليقات Potato التوضيحية:

python

import json
import xml.etree.ElementTree as ET
from pathlib import Path
from PIL import Image
 
def convert_to_voc(annotations_dir, images_dir, output_dir, scheme_name="objects"):
    """Convert Potato bounding box annotations to Pascal VOC XML format."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
 
    for file in sorted(Path(annotations_dir).glob("*.json")):
        with open(file) as f:
            data = json.load(f)
 
        for item_id, item_data in data.items():
            filename = item_data.get("filename", f"{item_id}.jpg")
            img_path = Path(images_dir) / filename
            if img_path.exists():
                img = Image.open(img_path)
                w, h = img.size
            else:
                w = item_data.get("width", 0)
                h = item_data.get("height", 0)
 
            root = ET.Element("annotation")
            ET.SubElement(root, "filename").text = filename
            size_el = ET.SubElement(root, "size")
            ET.SubElement(size_el, "width").text = str(w)
            ET.SubElement(size_el, "height").text = str(h)
            ET.SubElement(size_el, "depth").text = "3"
 
            bboxes = item_data.get("annotations", {}).get(scheme_name, [])
            for bbox in bboxes:
                obj = ET.SubElement(root, "object")
                ET.SubElement(obj, "name").text = bbox["label"]
                bndbox = ET.SubElement(obj, "bndbox")
                ET.SubElement(bndbox, "xmin").text = str(int(bbox["x"]))
                ET.SubElement(bndbox, "ymin").text = str(int(bbox["y"]))
                ET.SubElement(bndbox, "xmax").text = str(int(bbox["x"] + bbox["width"]))
                ET.SubElement(bndbox, "ymax").text = str(int(bbox["y"] + bbox["height"]))
 
            tree = ET.ElementTree(root)
            xml_file = Path(output_dir) / (Path(filename).stem + ".xml")
            tree.write(xml_file, encoding="unicode", xml_declaration=True)
 
# Usage
convert_to_voc("annotation_output/", "images/", "voc_annotations/", "objects")

نصوص التصدير المخصصة

نص تصدير أساسي

python

import json
import os
from pathlib import Path
 
def export_annotations(input_dir, output_file, format="json"):
    """Combine all annotator files into a single export."""
    all_annotations = []
 
    for file in Path(input_dir).glob("*.json"):
        with open(file) as f:
            data = json.load(f)
            all_annotations.extend(data)
 
    # Deduplicate by ID (keep latest)
    by_id = {}
    for ann in all_annotations:
        by_id[ann["id"]] = ann
 
    with open(output_file, "w") as f:
        json.dump(list(by_id.values()), f, indent=2)
 
# Usage
export_annotations("output/", "combined_annotations.json")

تجميع عدة معلّقين

python

from collections import Counter
 
def aggregate_labels(annotations_dir, scheme_name):
    """Majority vote aggregation for classification tasks."""
    from pathlib import Path
    import json
 
    # Collect all labels per item
    item_labels = {}
 
    for file in Path(annotations_dir).glob("*.json"):
        with open(file) as f:
            for ann in json.load(f):
                item_id = ann["id"]
                label = ann["annotations"].get(scheme_name)
 
                if item_id not in item_labels:
                    item_labels[item_id] = []
                item_labels[item_id].append(label)
 
    # Majority vote
    aggregated = {}
    for item_id, labels in item_labels.items():
        counter = Counter(labels)
        aggregated[item_id] = counter.most_common(1)[0][0]
 
    return aggregated

حساب اتفاق المعلّقين

python

from sklearn.metrics import cohen_kappa_score
import numpy as np
 
def compute_agreement(annotations_dir, scheme_name):
    """Compute Cohen's Kappa for overlapping annotations."""
    # Load annotations from two annotators
    ann1 = load_annotations(f"{annotations_dir}/user_1.json")
    ann2 = load_annotations(f"{annotations_dir}/user_2.json")
 
    # Find overlapping items
    common_ids = set(ann1.keys()) & set(ann2.keys())
 
    labels1 = [ann1[id][scheme_name] for id in common_ids]
    labels2 = [ann2[id][scheme_name] for id in common_ids]
 
    kappa = cohen_kappa_score(labels1, labels2)
    return kappa

أفضل الممارسات

1. التصدير المنتظم

إعداد تصدير آلي للنسخ الاحتياطي والتحليل:

python

# Add to your workflow
import schedule
 
def daily_export():
    export_annotations("output/", f"exports/annotations_{date.today()}.json")
 
schedule.every().day.at("18:00").do(daily_export)

2. تضمين البيانات الوصفية

الحفاظ على السياق في التصدير:

python

export_data = {
    "metadata": {
        "task_name": "Sentiment Analysis",
        "exported_at": datetime.now().isoformat(),
        "total_annotations": len(annotations),
        "annotators": list(set(a["annotator"] for a in annotations))
    },
    "annotations": annotations
}

3. التحقق من التصدير

فحص سلامة التصدير:

python

def validate_export(export_file, original_count):
    with open(export_file) as f:
        exported = json.load(f)
 
    assert len(exported) == original_count, "Missing annotations"
    assert all("id" in a for a in exported), "Missing IDs"
    print(f"Export validated: {len(exported)} annotations")

4. إصدار التصدير

استخدام الطوابع الزمنية أو أرقام الإصدار:

text

exports/
  annotations_v1_2024-01-15.json
  annotations_v2_2024-01-20.json
  annotations_final_2024-01-25.json

أمثلة التكامل

تدريب نموذج HuggingFace

python

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
 
# Load exported data
with open("aggregated_annotations.json") as f:
    data = json.load(f)
 
# Create dataset
dataset = Dataset.from_list([
    {"text": item["text"], "label": item["sentiment"]}
    for item in data
])
 
# Train model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
 
# ... continue with training

تدريب spaCy NER

python

import spacy
from spacy.tokens import DocBin
 
# Load exported spans
with open("ner_annotations.json") as f:
    data = json.load(f)
 
nlp = spacy.blank("en")
doc_bin = DocBin()
 
for item in data:
    doc = nlp.make_doc(item["text"])
    ents = []
    for span in item["entities"]:
        ent = doc.char_span(span["start"], span["end"], label=span["label"])
        if ent:
            ents.append(ent)
    doc.ents = ents
    doc_bin.add(doc)
 
doc_bin.to_disk("./train.spacy")

تدريب YOLO

bash

# After exporting to YOLO format
yolo train data=dataset.yaml model=yolov8n.pt epochs=100

dataset.yaml:

yaml

train: ./images/train
val: ./images/val
nc: 3
names: ['person', 'car', 'dog']