导出格式
将标注导出为各种格式,用于机器学习框架和分析工具。
导出格式
Potato 提供两个层次的导出功能:
- 原生导出 - 标注根据配置自动保存为 JSON/JSONL/CSV/TSV 格式
- 导出 CLI(v2.2.0 新增)-
python -m potato.export将标注转换为专用格式(COCO、YOLO、Pascal VOC、CoNLL-2003、CoNLL-U、Mask PNG)
本页涵盖内置格式和导出 CLI,以及常见目标的示例转换脚本。
基本导出格式
JSON
默认输出格式。每个标注者的工作保存为一个 JSON 文件:
json
{
"id": "doc_001",
"annotations": {
"sentiment": "positive",
"confidence": 4
},
"annotator": "user_1",
"timestamp": "2024-01-15T10:30:00Z"
}在 YAML 中配置:
yaml
output_annotation_format: "json"
output_annotation_dir: "output/"JSON Lines (JSONL)
每行一个标注,适合流式处理和大型数据集:
jsonl
{"id": "doc_001", "annotations": {"sentiment": "positive"}, "annotator": "user_1"}
{"id": "doc_002", "annotations": {"sentiment": "negative"}, "annotator": "user_1"}yaml
output_annotation_format: "jsonl"CSV
表格格式,用于电子表格分析:
csv
id,annotator,sentiment,confidence,timestamp
doc_001,user_1,positive,4,2024-01-15T10:30:00Z
doc_002,user_1,negative,2,2024-01-15T10:31:00Zyaml
output_annotation_format: "csv"TSV
制表符分隔值:
yaml
output_annotation_format: "tsv"导出 CLI
v2.2.0 新增
导出 CLI 用一条命令将 Potato 标注转换为专用格式:
bash
# List available export formats
python -m potato.export --list-formats
# Export to COCO format
python -m potato.export --config config.yaml --format coco --output ./export/
# Export to YOLO format
python -m potato.export --config config.yaml --format yolo --output ./export/
# Export with options
python -m potato.export --config config.yaml --format coco --output ./export/ \
--option split_ratio=0.8 --option include_unlabeled=falseCLI 选项
| 选项 | 描述 |
|---|---|
--config, -c | Potato YAML 配置文件路径 |
--format, -f | 导出格式(coco、yolo、pascal_voc、conll_2003、conll_u、mask) |
--output, -o | 输出目录(默认:./export_output) |
--option | 格式特定选项,格式为 key=value(可重复) |
--list-formats | 列出可用格式并退出 |
--verbose, -v | 启用详细日志 |
支持的导出格式
| 格式 | ID | 最适用于 |
|---|---|---|
| COCO | coco | 目标检测、实例分割 |
| YOLO | yolo | YOLO 模型训练 |
| Pascal VOC | pascal_voc | 目标检测(XML) |
| CoNLL-2003 | conll_2003 | 命名实体识别、序列标注 |
| CoNLL-U | conll_u | 词性标注、依存句法分析 |
| 分割掩码 | mask | 语义/实例分割 |
格式兼容性矩阵
| 标注类型 | COCO | YOLO | Pascal VOC | CoNLL-2003 | CoNLL-U | Mask |
|---|---|---|---|---|---|---|
| 边界框 | 是 | 是 | 是 | - | - | - |
| 多边形 | 是 | - | - | - | - | 是 |
| 关键点 | 是 | - | - | - | - | - |
| 文本片段 | - | - | - | 是 | 是 | - |
| 分类 | 部分 | - | - | - | - | - |
编程导出
在 Python 中直接使用导出注册表:
python
from potato.export.registry import export_registry
from potato.export.cli import build_export_context
context = build_export_context("path/to/config.yaml")
result = export_registry.export("coco", context, "./output/")
if result.success:
print(f"Exported {len(result.files_written)} files")自定义导出器
通过继承 BaseExporter 创建自定义导出器:
python
from potato.export.base import BaseExporter, ExportContext, ExportResult
class MyExporter(BaseExporter):
format_name = "my_format"
description = "My custom export format"
file_extensions = [".myformat"]
def can_export(self, context: ExportContext) -> tuple:
has_spans = any(ann.get("spans") for ann in context.annotations)
if not has_spans:
return False, "No span annotations found"
return True, None
def export(self, context: ExportContext, output_path: str,
options: dict = None) -> ExportResult:
# Perform the export
return ExportResult(
success=True,
format_name=self.format_name,
files_written=["output.myformat"],
stats={"annotations": len(context.annotations)}
)
from potato.export.registry import export_registry
export_registry.register(MyExporter())NLP 导出格式
CoNLL 格式
序列标注任务(NER、POS 标注)的标准格式:
text
The O
quick B-ADJ
brown I-ADJ
fox B-NOUN
jumps B-VERB
读取 Potato 片段标注并写入 CoNLL 格式的示例转换脚本:
python
import json
from pathlib import Path
def convert_to_conll(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to CoNLL format."""
with open(output_file, "w") as out:
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
tokens = text.split()
labels = ["O"] * len(tokens)
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
start_tok = span.get("start")
end_tok = span.get("end")
label = span.get("label", "ENT")
for i in range(start_tok, min(end_tok, len(tokens))):
prefix = "B" if i == start_tok else "I"
labels[i] = f"{prefix}-{label}"
for token, label in zip(tokens, labels):
out.write(f"{token}\t{label}\n")
out.write("\n")
# Usage
convert_to_conll("annotation_output/", "annotations.conll", "entities")IOB2 格式
用于实体识别的内部-外部-开始标注:
text
John B-PER
Smith I-PER
works O
at O
Google B-ORG
spaCy 格式
读取 Potato 输出并创建 spaCy DocBin 用于 NER 训练的示例转换脚本:
python
import json
import spacy
from spacy.tokens import DocBin
from pathlib import Path
def convert_to_spacy(annotations_dir, output_file, scheme_name="entities"):
"""Convert Potato span annotations to spaCy DocBin format."""
nlp = spacy.blank("en")
doc_bin = DocBin()
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
text = item_data.get("text", "")
doc = nlp.make_doc(text)
ents = []
spans = item_data.get("annotations", {}).get(scheme_name, [])
for span in spans:
char_span = doc.char_span(
span["start_offset"], span["end_offset"],
label=span["label"]
)
if char_span is not None:
ents.append(char_span)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk(output_file)
# Usage
convert_to_spacy("annotation_output/", "train.spacy", "entities")输出可以直接用于 spacy train:
bash
python -m spacy train config.cfg --paths.train ./annotations.spacyHuggingFace Datasets
使用 datasets 库将 Potato 输出转换为 HuggingFace Dataset 的示例转换脚本:
python
import json
from pathlib import Path
from datasets import Dataset, DatasetDict
def convert_to_huggingface(annotations_dir, output_dir, scheme_names):
"""Convert Potato annotations to a HuggingFace Dataset."""
records = []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
record = {"id": item_id, "text": item_data.get("text", "")}
annotations = item_data.get("annotations", {})
for scheme in scheme_names:
record[scheme] = annotations.get(scheme)
records.append(record)
dataset = Dataset.from_list(records)
dataset.save_to_disk(output_dir)
print(f"Saved {len(records)} examples to {output_dir}")
# Usage
convert_to_huggingface("annotation_output/", "hf_dataset/", ["sentiment", "entities"])在训练脚本中加载:
python
from datasets import load_from_disk
dataset = load_from_disk("hf_dataset/")计算机视觉导出格式
COCO 格式
目标检测和分割的标准格式:
json
{
"images": [
{"id": 1, "file_name": "image_001.jpg", "width": 640, "height": 480}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 150, 200, 300],
"area": 60000,
"segmentation": [[100, 150, 300, 150, 300, 450, 100, 450]]
}
],
"categories": [
{"id": 1, "name": "person"}
]
}读取 Potato 边界框标注并写入 COCO JSON 的示例转换脚本:
python
import json
from pathlib import Path
from PIL import Image
def convert_to_coco(annotations_dir, images_dir, output_file, scheme_name="objects"):
"""Convert Potato bounding box annotations to COCO format."""
coco = {"images": [], "annotations": [], "categories": []}
category_map = {}
ann_id = 1
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for img_idx, (item_id, item_data) in enumerate(data.items(), start=1):
# Get image dimensions
img_path = Path(images_dir) / item_data.get("filename", f"{item_id}.jpg")
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w, h = item_data.get("width", 0), item_data.get("height", 0)
coco["images"].append({
"id": img_idx,
"file_name": img_path.name,
"width": w, "height": h
})
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
label = bbox["label"]
if label not in category_map:
cat_id = len(category_map) + 1
category_map[label] = cat_id
coco["categories"].append({"id": cat_id, "name": label})
x, y = bbox["x"], bbox["y"]
bw, bh = bbox["width"], bbox["height"]
coco["annotations"].append({
"id": ann_id, "image_id": img_idx,
"category_id": category_map[label],
"bbox": [x, y, bw, bh],
"area": bw * bh, "iscrowd": 0
})
ann_id += 1
with open(output_file, "w") as f:
json.dump(coco, f, indent=2)
# Usage
convert_to_coco("annotation_output/", "images/", "coco_annotations.json", "objects")YOLO 格式
每张图片一个文本文件用于 YOLO 训练:
text
# class_id center_x center_y width height (normalized 0-1)
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2
从 Potato 标注写入 YOLO 格式标签文件的示例转换脚本:
python
import json
from pathlib import Path
from PIL import Image
def convert_to_yolo(annotations_dir, images_dir, output_dir, scheme_name="objects",
class_names=None):
"""Convert Potato bounding box annotations to YOLO format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
class_names = class_names or []
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
img_w, img_h = img.size
else:
img_w = item_data.get("width", 1)
img_h = item_data.get("height", 1)
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
label_file = Path(output_dir) / (Path(filename).stem + ".txt")
with open(label_file, "w") as out:
for bbox in bboxes:
label = bbox["label"]
class_id = class_names.index(label) if label in class_names else 0
cx = (bbox["x"] + bbox["width"] / 2) / img_w
cy = (bbox["y"] + bbox["height"] / 2) / img_h
nw = bbox["width"] / img_w
nh = bbox["height"] / img_h
out.write(f"{class_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
# Usage
convert_to_yolo(
"annotation_output/", "images/", "yolo_labels/",
"objects", class_names=["person", "car", "dog"]
)Pascal VOC 格式
许多检测框架使用的 XML 格式:
xml
<annotation>
<filename>image_001.jpg</filename>
<size>
<width>640</width>
<height>480</height>
</size>
<object>
<name>person</name>
<bndbox>
<xmin>100</xmin>
<ymin>150</ymin>
<xmax>300</xmax>
<ymax>450</ymax>
</bndbox>
</object>
</annotation>从 Potato 标注写入 Pascal VOC XML 文件的示例转换脚本:
python
import json
import xml.etree.ElementTree as ET
from pathlib import Path
from PIL import Image
def convert_to_voc(annotations_dir, images_dir, output_dir, scheme_name="objects"):
"""Convert Potato bounding box annotations to Pascal VOC XML format."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
for file in sorted(Path(annotations_dir).glob("*.json")):
with open(file) as f:
data = json.load(f)
for item_id, item_data in data.items():
filename = item_data.get("filename", f"{item_id}.jpg")
img_path = Path(images_dir) / filename
if img_path.exists():
img = Image.open(img_path)
w, h = img.size
else:
w = item_data.get("width", 0)
h = item_data.get("height", 0)
root = ET.Element("annotation")
ET.SubElement(root, "filename").text = filename
size_el = ET.SubElement(root, "size")
ET.SubElement(size_el, "width").text = str(w)
ET.SubElement(size_el, "height").text = str(h)
ET.SubElement(size_el, "depth").text = "3"
bboxes = item_data.get("annotations", {}).get(scheme_name, [])
for bbox in bboxes:
obj = ET.SubElement(root, "object")
ET.SubElement(obj, "name").text = bbox["label"]
bndbox = ET.SubElement(obj, "bndbox")
ET.SubElement(bndbox, "xmin").text = str(int(bbox["x"]))
ET.SubElement(bndbox, "ymin").text = str(int(bbox["y"]))
ET.SubElement(bndbox, "xmax").text = str(int(bbox["x"] + bbox["width"]))
ET.SubElement(bndbox, "ymax").text = str(int(bbox["y"] + bbox["height"]))
tree = ET.ElementTree(root)
xml_file = Path(output_dir) / (Path(filename).stem + ".xml")
tree.write(xml_file, encoding="unicode", xml_declaration=True)
# Usage
convert_to_voc("annotation_output/", "images/", "voc_annotations/", "objects")自定义导出脚本
基本导出脚本
python
import json
import os
from pathlib import Path
def export_annotations(input_dir, output_file, format="json"):
"""Combine all annotator files into a single export."""
all_annotations = []
for file in Path(input_dir).glob("*.json"):
with open(file) as f:
data = json.load(f)
all_annotations.extend(data)
# Deduplicate by ID (keep latest)
by_id = {}
for ann in all_annotations:
by_id[ann["id"]] = ann
with open(output_file, "w") as f:
json.dump(list(by_id.values()), f, indent=2)
# Usage
export_annotations("output/", "combined_annotations.json")多标注者聚合
python
from collections import Counter
def aggregate_labels(annotations_dir, scheme_name):
"""Majority vote aggregation for classification tasks."""
from pathlib import Path
import json
# Collect all labels per item
item_labels = {}
for file in Path(annotations_dir).glob("*.json"):
with open(file) as f:
for ann in json.load(f):
item_id = ann["id"]
label = ann["annotations"].get(scheme_name)
if item_id not in item_labels:
item_labels[item_id] = []
item_labels[item_id].append(label)
# Majority vote
aggregated = {}
for item_id, labels in item_labels.items():
counter = Counter(labels)
aggregated[item_id] = counter.most_common(1)[0][0]
return aggregated计算标注者间一致性
python
from sklearn.metrics import cohen_kappa_score
import numpy as np
def compute_agreement(annotations_dir, scheme_name):
"""Compute Cohen's Kappa for overlapping annotations."""
# Load annotations from two annotators
ann1 = load_annotations(f"{annotations_dir}/user_1.json")
ann2 = load_annotations(f"{annotations_dir}/user_2.json")
# Find overlapping items
common_ids = set(ann1.keys()) & set(ann2.keys())
labels1 = [ann1[id][scheme_name] for id in common_ids]
labels2 = [ann2[id][scheme_name] for id in common_ids]
kappa = cohen_kappa_score(labels1, labels2)
return kappa最佳实践
1. 定期导出
设置自动导出用于备份和分析:
python
# Add to your workflow
import schedule
def daily_export():
export_annotations("output/", f"exports/annotations_{date.today()}.json")
schedule.every().day.at("18:00").do(daily_export)2. 包含元数据
在导出中保留上下文:
python
export_data = {
"metadata": {
"task_name": "Sentiment Analysis",
"exported_at": datetime.now().isoformat(),
"total_annotations": len(annotations),
"annotators": list(set(a["annotator"] for a in annotations))
},
"annotations": annotations
}3. 验证导出
检查导出完整性:
python
def validate_export(export_file, original_count):
with open(export_file) as f:
exported = json.load(f)
assert len(exported) == original_count, "Missing annotations"
assert all("id" in a for a in exported), "Missing IDs"
print(f"Export validated: {len(exported)} annotations")4. 版本化导出
使用时间戳或版本号:
text
exports/
annotations_v1_2024-01-15.json
annotations_v2_2024-01-20.json
annotations_final_2024-01-25.json
集成示例
训练 HuggingFace 模型
python
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
# Load exported data
with open("aggregated_annotations.json") as f:
data = json.load(f)
# Create dataset
dataset = Dataset.from_list([
{"text": item["text"], "label": item["sentiment"]}
for item in data
])
# Train model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
# ... continue with training训练 spaCy NER
python
import spacy
from spacy.tokens import DocBin
# Load exported spans
with open("ner_annotations.json") as f:
data = json.load(f)
nlp = spacy.blank("en")
doc_bin = DocBin()
for item in data:
doc = nlp.make_doc(item["text"])
ents = []
for span in item["entities"]:
ent = doc.char_span(span["start"], span["end"], label=span["label"])
if ent:
ents.append(ent)
doc.ents = ents
doc_bin.add(doc)
doc_bin.to_disk("./train.spacy")YOLO 训练
bash
# After exporting to YOLO format
yolo train data=dataset.yaml model=yolov8n.pt epochs=100dataset.yaml:
yaml
train: ./images/train
val: ./images/val
nc: 3
names: ['person', 'car', 'dog']