Hugging Face Datasets 是共享和加载机器学习数据集的标准格式。本指南展示如何使用 Python 脚本将 Potato 标注转换为模型训练和数据集共享的格式。

为什么选择 Hugging Face 格式？

标准格式：兼容所有 HF 工具
高效存储：Arrow 格式实现快速加载
便于分享：直接推送到 Hub
训练就绪：与 Transformers 直接集成

使用 Python 进行基本导出

Potato 以 JSONL 格式保存标注。你可以使用 datasets 库将其转换为 Hugging Face 数据集。

加载 Potato 标注

python

import json
from datasets import Dataset
 
# Load Potato annotation output
annotations = []
with open("annotation_output/annotated_instances.jsonl", "r") as f:
    for line in f:
        annotations.append(json.loads(line))
 
# Convert to Hugging Face Dataset
dataset = Dataset.from_list([
    {
        "text": ann["text"],
        "label": ann["label_annotations"]["sentiment"]["label"]
    }
    for ann in annotations
])
 
# Save locally
dataset.save_to_disk("my_dataset")
 
# Or push to Hub
dataset.push_to_hub("username/my-dataset")

创建训练/测试分割

python

from sklearn.model_selection import train_test_split
 
# Split annotations
train_data, temp_data = train_test_split(annotations, test_size=0.2, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)
 
# Create datasets
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
test_dataset = Dataset.from_list(test_data)
 
# Combine into DatasetDict
from datasets import DatasetDict
dataset = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
    "test": test_dataset
})

特定任务导出

文本分类

python

from datasets import Dataset, ClassLabel
 
# Load and process sentiment annotations
dataset = Dataset.from_dict({
    "text": [ann["text"] for ann in annotations],
    "label": [ann["label_annotations"]["sentiment"]["label"] for ann in annotations]
})
 
# Define label mapping
dataset = dataset.cast_column(
    "label",
    ClassLabel(names=["Positive", "Negative", "Neutral"])
)

命名实体识别

python

# Convert span annotations to IOB format
def convert_to_iob(text, spans):
    tokens = text.split()
    labels = ["O"] * len(tokens)
 
    for span in spans:
        # Map character offsets to token indices
        start_token, end_token = char_to_token(text, span["start"], span["end"])
        labels[start_token] = f"B-{span['annotation']}"
        for i in range(start_token + 1, end_token):
            labels[i] = f"I-{span['annotation']}"
 
    return tokens, labels
 
# Potato stores span annotations in span_annotations field
dataset = Dataset.from_dict({
    "tokens": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[0] for a in annotations],
    "ner_tags": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[1] for a in annotations]
})

音频分类

python

from datasets import Audio
 
# For audio annotation tasks
dataset = Dataset.from_dict({
    "audio": [ann["audio"] for ann in annotations],
    "label": [ann["label_annotations"]["emotion"]["label"] for ann in annotations]
})
 
# Cast to Audio feature
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

图像分类

python

from datasets import Image
 
# For image annotation tasks
dataset = Dataset.from_dict({
    "image": [ann["image"] for ann in annotations],
    "label": [ann["label_annotations"]["category"]["label"] for ann in annotations]
})
 
dataset = dataset.cast_column("image", Image())

多标注者导出

当每个项目有多个标注者时，可以使用不同的格式导出：

python

# Long format (one row per annotation)
# Each annotator's work is saved in a separate file: annotator_{id}.jsonl
import glob
 
records = []
for filepath in glob.glob("annotation_output/annotator_*.jsonl"):
    annotator_id = filepath.split("_")[-1].replace(".jsonl", "")
    with open(filepath) as f:
        for line in f:
            ann = json.loads(line)
            records.append({
                "id": ann["id"],
                "text": ann["text"],
                "label": ann["label_annotations"]["sentiment"]["label"],
                "annotator": annotator_id
            })
 
dataset = Dataset.from_list(records)
 
# Or aggregate annotations per item
from collections import defaultdict
from statistics import mode
 
items = defaultdict(list)
for record in records:
    items[record["id"]].append(record)
 
aggregated = []
for item_id, anns in items.items():
    labels = [a["label"] for a in anns]
    aggregated.append({
        "id": item_id,
        "text": anns[0]["text"],
        "label": mode(labels),  # Majority vote
        "num_annotators": len(labels)
    })
 
dataset = Dataset.from_list(aggregated)

推送到 Hugging Face Hub

python

from huggingface_hub import login
 
# Login (or use HF_TOKEN env var)
login()
 
# Push dataset
dataset.push_to_hub(
    "username/my-sentiment-dataset",
    private=False,
    token=None  # Uses cached token
)
 
# With dataset card
dataset.push_to_hub(
    "username/my-sentiment-dataset",
    commit_message="Initial upload of sentiment annotations",
)

数据集卡片

为你的数据集创建 README.md：

markdown

---
license: cc-by-4.0
task_categories:
  - text-classification
language:
  - en
size_categories:
  - 1K<n<10K
---
 
# My Sentiment Dataset
 
## Dataset Description
 
Sentiment annotations collected using [Potato](https://potato.iro.umich.edu).
 
## Dataset Structure
 
- **train**: 8,000 examples
- **validation**: 1,000 examples
- **test**: 1,000 examples
 
### Labels
 
- Positive
- Negative
- Neutral
 
## Annotation Process
 
Annotated by 3 workers per item on Prolific.
Inter-annotator agreement (Fleiss' Kappa): 0.75
 
## Citation
 
@article{...}

加载你的数据集

python

from datasets import load_dataset
 
# From Hub
dataset = load_dataset("username/my-sentiment-dataset")
 
# From local
dataset = load_dataset("my_dataset/")
 
# Use for training
from transformers import Trainer
 
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    ...
)

最佳实践

包含元数据：来源、标注过程、一致性信息
记录标签：清晰的标签定义
版本化数据集：跟踪随时间的变化
添加引用：注明标注方法
明确许可：指定使用条款

完整导出文档请参阅 /docs/core-concepts/data-formats。