모델을 학습한다면 데이터는 머지않아 Hugging Face 데이터셋으로 정리되어야 할 가능성이 높습니다. 나머지 생태계가 기대하는 형식이기 때문입니다. 이 가이드에서는 로컬에서 학습하든 Hub에서 데이터셋을 공유하든, Potato의 주석 출력을 짧은 파이썬 스크립트 몇 개로 그 형식으로 바꾸는 방법을 보여 드립니다.

굳이 Hugging Face 형식을 쓰는 이유

도구들이 이미 이해하는 형식이라, 라이브러리마다 연결 코드를 따로 작성하지 않아도 됩니다. 데이터셋은 Arrow로 저장되어 규모가 커져도 빠르게 로드됩니다. 공유는 push_to_hub 호출 한 번이면 됩니다. 그리고 Trainer가 이를 직접 읽기 때문에 학습 전에 별도의 변환 단계가 필요 없습니다.

Python으로 기본 내보내기

Potato는 주석을 JSONL로 기록합니다. datasets 라이브러리가 이를 Hugging Face 데이터셋으로 변환합니다.

Potato 주석 불러오기

python

import json
from datasets import Dataset
 
# Load Potato annotation output
annotations = []
with open("annotation_output/annotated_instances.jsonl", "r") as f:
    for line in f:
        annotations.append(json.loads(line))
 
# Convert to Hugging Face Dataset
dataset = Dataset.from_list([
    {
        "text": ann["text"],
        "label": ann["label_annotations"]["sentiment"]["label"]
    }
    for ann in annotations
])
 
# Save locally
dataset.save_to_disk("my_dataset")
 
# Or push to Hub
dataset.push_to_hub("username/my-dataset")

학습/테스트 분할 만들기

python

from sklearn.model_selection import train_test_split
 
# Split annotations
train_data, temp_data = train_test_split(annotations, test_size=0.2, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)
 
# Create datasets
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
test_dataset = Dataset.from_list(test_data)
 
# Combine into DatasetDict
from datasets import DatasetDict
dataset = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
    "test": test_dataset
})

작업별 내보내기

텍스트 분류

python

from datasets import Dataset, ClassLabel
 
# Load and process sentiment annotations
dataset = Dataset.from_dict({
    "text": [ann["text"] for ann in annotations],
    "label": [ann["label_annotations"]["sentiment"]["label"] for ann in annotations]
})
 
# Define label mapping
dataset = dataset.cast_column(
    "label",
    ClassLabel(names=["Positive", "Negative", "Neutral"])
)

개체명 인식

python

# Convert span annotations to IOB format
def convert_to_iob(text, spans):
    tokens = text.split()
    labels = ["O"] * len(tokens)
 
    for span in spans:
        # Map character offsets to token indices
        start_token, end_token = char_to_token(text, span["start"], span["end"])
        labels[start_token] = f"B-{span['annotation']}"
        for i in range(start_token + 1, end_token):
            labels[i] = f"I-{span['annotation']}"
 
    return tokens, labels
 
# Potato stores span annotations in span_annotations field
dataset = Dataset.from_dict({
    "tokens": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[0] for a in annotations],
    "ner_tags": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[1] for a in annotations]
})

오디오 분류

python

from datasets import Audio
 
# For audio annotation tasks
dataset = Dataset.from_dict({
    "audio": [ann["audio"] for ann in annotations],
    "label": [ann["label_annotations"]["emotion"]["label"] for ann in annotations]
})
 
# Cast to Audio feature
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

이미지 분류

python

from datasets import Image
 
# For image annotation tasks
dataset = Dataset.from_dict({
    "image": [ann["image"] for ann in annotations],
    "label": [ann["label_annotations"]["category"]["label"] for ann in annotations]
})
 
dataset = dataset.cast_column("image", Image())

다중 주석자 내보내기

항목마다 여러 명의 주석자가 있을 때는 다양한 형식으로 내보낼 수 있습니다.

python

# Long format (one row per annotation)
# Each annotator's work is saved in a separate file: annotator_{id}.jsonl
import glob
 
records = []
for filepath in glob.glob("annotation_output/annotator_*.jsonl"):
    annotator_id = filepath.split("_")[-1].replace(".jsonl", "")
    with open(filepath) as f:
        for line in f:
            ann = json.loads(line)
            records.append({
                "id": ann["id"],
                "text": ann["text"],
                "label": ann["label_annotations"]["sentiment"]["label"],
                "annotator": annotator_id
            })
 
dataset = Dataset.from_list(records)
 
# Or aggregate annotations per item
from collections import defaultdict
from statistics import mode
 
items = defaultdict(list)
for record in records:
    items[record["id"]].append(record)
 
aggregated = []
for item_id, anns in items.items():
    labels = [a["label"] for a in anns]
    aggregated.append({
        "id": item_id,
        "text": anns[0]["text"],
        "label": mode(labels),  # Majority vote
        "num_annotators": len(labels)
    })
 
dataset = Dataset.from_list(aggregated)

Hugging Face Hub로 푸시하기

python

from huggingface_hub import login
 
# Login (or use HF_TOKEN env var)
login()
 
# Push dataset
dataset.push_to_hub(
    "username/my-sentiment-dataset",
    private=False,
    token=None  # Uses cached token
)
 
# With dataset card
dataset.push_to_hub(
    "username/my-sentiment-dataset",
    commit_message="Initial upload of sentiment annotations",
)

데이터셋 카드

데이터셋용 README.md를 만듭니다.

markdown

---
license: cc-by-4.0
task_categories:
  - text-classification
language:
  - en
size_categories:
  - 1K<n<10K
---
 
# My Sentiment Dataset
 
## Dataset Description
 
Sentiment annotations collected using [Potato](https://potato.iro.umich.edu).
 
## Dataset Structure
 
- **train**: 8,000 examples
- **validation**: 1,000 examples
- **test**: 1,000 examples
 
### Labels
 
- Positive
- Negative
- Neutral
 
## Annotation Process
 
Annotated by 3 workers per item on Prolific.
Inter-annotator agreement (Fleiss' Kappa): 0.75
 
## Citation
 
@article{...}

데이터셋 불러오기

python

from datasets import load_dataset
 
# From Hub
dataset = load_dataset("username/my-sentiment-dataset")
 
# From local
dataset = load_dataset("my_dataset/")
 
# Use for training
from transformers import Trainer
 
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    ...
)

챙겨 둘 만한 몇 가지 습관

데이터가 어디서 왔는지, 어떻게 주석되었는지, 측정한 일치도가 얼마인지 적어 두십시오. 데이터셋을 재사용하는 사람들이 물어볼 것이기 때문입니다. 각 라벨은 이름만으로 알아서 이해되리라 가정하지 말고 쉬운 말로 정의하십시오. 데이터셋에 버전을 매겨 릴리스 사이에 무엇이 바뀌었는지 알 수 있게 하십시오. 주석 방법론에 대해 출처를 밝히십시오. 그리고 라이선스를 처음부터 분명히 밝혀, 누구도 사용해도 되는지 추측하지 않게 하십시오.

Potato가 직접 제공하는 내보내기 옵션은 Hugging Face 내보내기 문서를 참고하십시오.

전체 내보내기 문서는 /docs/core-concepts/data-formats에서 확인하십시오.