Guides4 min read
将标注导出到 Hugging Face Datasets
如何将 Potato 标注转换为 Hugging Face 数据集格式,用于模型训练和数据集共享。
Potato Team
Hugging Face Datasets 是共享和加载机器学习数据集的标准格式。本指南展示如何使用 Python 脚本将 Potato 标注转换为模型训练和数据集共享的格式。
为什么选择 Hugging Face 格式?
- 标准格式:兼容所有 HF 工具
- 高效存储:Arrow 格式实现快速加载
- 便于分享:直接推送到 Hub
- 训练就绪:与 Transformers 直接集成
使用 Python 进行基本导出
Potato 以 JSONL 格式保存标注。你可以使用 datasets 库将其转换为 Hugging Face 数据集。
加载 Potato 标注
python
import json
from datasets import Dataset
# Load Potato annotation output
annotations = []
with open("annotation_output/annotated_instances.jsonl", "r") as f:
for line in f:
annotations.append(json.loads(line))
# Convert to Hugging Face Dataset
dataset = Dataset.from_list([
{
"text": ann["text"],
"label": ann["label_annotations"]["sentiment"]["label"]
}
for ann in annotations
])
# Save locally
dataset.save_to_disk("my_dataset")
# Or push to Hub
dataset.push_to_hub("username/my-dataset")创建训练/测试分割
python
from sklearn.model_selection import train_test_split
# Split annotations
train_data, temp_data = train_test_split(annotations, test_size=0.2, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)
# Create datasets
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
test_dataset = Dataset.from_list(test_data)
# Combine into DatasetDict
from datasets import DatasetDict
dataset = DatasetDict({
"train": train_dataset,
"validation": val_dataset,
"test": test_dataset
})特定任务导出
文本分类
python
from datasets import Dataset, ClassLabel
# Load and process sentiment annotations
dataset = Dataset.from_dict({
"text": [ann["text"] for ann in annotations],
"label": [ann["label_annotations"]["sentiment"]["label"] for ann in annotations]
})
# Define label mapping
dataset = dataset.cast_column(
"label",
ClassLabel(names=["Positive", "Negative", "Neutral"])
)命名实体识别
python
# Convert span annotations to IOB format
def convert_to_iob(text, spans):
tokens = text.split()
labels = ["O"] * len(tokens)
for span in spans:
# Map character offsets to token indices
start_token, end_token = char_to_token(text, span["start"], span["end"])
labels[start_token] = f"B-{span['annotation']}"
for i in range(start_token + 1, end_token):
labels[i] = f"I-{span['annotation']}"
return tokens, labels
# Potato stores span annotations in span_annotations field
dataset = Dataset.from_dict({
"tokens": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[0] for a in annotations],
"ner_tags": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[1] for a in annotations]
})音频分类
python
from datasets import Audio
# For audio annotation tasks
dataset = Dataset.from_dict({
"audio": [ann["audio"] for ann in annotations],
"label": [ann["label_annotations"]["emotion"]["label"] for ann in annotations]
})
# Cast to Audio feature
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))图像分类
python
from datasets import Image
# For image annotation tasks
dataset = Dataset.from_dict({
"image": [ann["image"] for ann in annotations],
"label": [ann["label_annotations"]["category"]["label"] for ann in annotations]
})
dataset = dataset.cast_column("image", Image())多标注者导出
当每个项目有多个标注者时,可以使用不同的格式导出:
python
# Long format (one row per annotation)
# Each annotator's work is saved in a separate file: annotator_{id}.jsonl
import glob
records = []
for filepath in glob.glob("annotation_output/annotator_*.jsonl"):
annotator_id = filepath.split("_")[-1].replace(".jsonl", "")
with open(filepath) as f:
for line in f:
ann = json.loads(line)
records.append({
"id": ann["id"],
"text": ann["text"],
"label": ann["label_annotations"]["sentiment"]["label"],
"annotator": annotator_id
})
dataset = Dataset.from_list(records)
# Or aggregate annotations per item
from collections import defaultdict
from statistics import mode
items = defaultdict(list)
for record in records:
items[record["id"]].append(record)
aggregated = []
for item_id, anns in items.items():
labels = [a["label"] for a in anns]
aggregated.append({
"id": item_id,
"text": anns[0]["text"],
"label": mode(labels), # Majority vote
"num_annotators": len(labels)
})
dataset = Dataset.from_list(aggregated)推送到 Hugging Face Hub
python
from huggingface_hub import login
# Login (or use HF_TOKEN env var)
login()
# Push dataset
dataset.push_to_hub(
"username/my-sentiment-dataset",
private=False,
token=None # Uses cached token
)
# With dataset card
dataset.push_to_hub(
"username/my-sentiment-dataset",
commit_message="Initial upload of sentiment annotations",
)数据集卡片
为你的数据集创建 README.md:
markdown
---
license: cc-by-4.0
task_categories:
- text-classification
language:
- en
size_categories:
- 1K<n<10K
---
# My Sentiment Dataset
## Dataset Description
Sentiment annotations collected using [Potato](https://potato.iro.umich.edu).
## Dataset Structure
- **train**: 8,000 examples
- **validation**: 1,000 examples
- **test**: 1,000 examples
### Labels
- Positive
- Negative
- Neutral
## Annotation Process
Annotated by 3 workers per item on Prolific.
Inter-annotator agreement (Fleiss' Kappa): 0.75
## Citation
@article{...}加载你的数据集
python
from datasets import load_dataset
# From Hub
dataset = load_dataset("username/my-sentiment-dataset")
# From local
dataset = load_dataset("my_dataset/")
# Use for training
from transformers import Trainer
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
...
)最佳实践
- 包含元数据:来源、标注过程、一致性信息
- 记录标签:清晰的标签定义
- 版本化数据集:跟踪随时间的变化
- 添加引用:注明标注方法
- 明确许可:指定使用条款
完整导出文档请参阅 /docs/core-concepts/data-formats。