# Exporting Annotations to Hugging Face Datasets

Source: https://www.potatoannotator.com/blog/exporting-to-huggingface

If you are training models, the data probably needs to end up as a Hugging Face dataset sooner or later. It is what the rest of the ecosystem expects. This guide shows how to turn Potato's annotation output into that format with a few short Python scripts, whether you are training locally or sharing the dataset on the Hub.

## Why bother with the Hugging Face format

It is the format the tooling already speaks, so you don't write glue code for every library. Datasets are stored as Arrow, which loads fast even when they get large. Sharing is a single `push_to_hub` call. And `Trainer` reads it directly, so there is no extra conversion step before training.

## Basic export with Python

Potato writes annotations as JSONL. The `datasets` library turns that into a Hugging Face dataset.

### Loading Potato annotations

```python
import json
from datasets import Dataset

# Load Potato annotation output
annotations = []
with open("annotation_output/annotated_instances.jsonl", "r") as f:
    for line in f:
        annotations.append(json.loads(line))

# Convert to Hugging Face Dataset
dataset = Dataset.from_list([
    {
        "text": ann["text"],
        "label": ann["label_annotations"]["sentiment"]["label"]
    }
    for ann in annotations
])

# Save locally
dataset.save_to_disk("my_dataset")

# Or push to Hub
dataset.push_to_hub("username/my-dataset")
```

### Creating Train/Test Splits

```python
from sklearn.model_selection import train_test_split

# Split annotations
train_data, temp_data = train_test_split(annotations, test_size=0.2, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Create datasets
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
test_dataset = Dataset.from_list(test_data)

# Combine into DatasetDict
from datasets import DatasetDict
dataset = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
    "test": test_dataset
})
```

## Task-Specific Exports

### Text Classification

```python
from datasets import Dataset, ClassLabel

# Load and process sentiment annotations
dataset = Dataset.from_dict({
    "text": [ann["text"] for ann in annotations],
    "label": [ann["label_annotations"]["sentiment"]["label"] for ann in annotations]
})

# Define label mapping
dataset = dataset.cast_column(
    "label",
    ClassLabel(names=["Positive", "Negative", "Neutral"])
)
```

### Named Entity Recognition

```python
# Convert span annotations to IOB format
def convert_to_iob(text, spans):
    tokens = text.split()
    labels = ["O"] * len(tokens)

    for span in spans:
        # Map character offsets to token indices
        start_token, end_token = char_to_token(text, span["start"], span["end"])
        labels[start_token] = f"B-{span['annotation']}"
        for i in range(start_token + 1, end_token):
            labels[i] = f"I-{span['annotation']}"

    return tokens, labels

# Potato stores span annotations in span_annotations field
dataset = Dataset.from_dict({
    "tokens": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[0] for a in annotations],
    "ner_tags": [convert_to_iob(a["text"], a.get("span_annotations", {}).get("entities", []))[1] for a in annotations]
})
```

### Audio Classification

```python
from datasets import Audio

# For audio annotation tasks
dataset = Dataset.from_dict({
    "audio": [ann["audio"] for ann in annotations],
    "label": [ann["label_annotations"]["emotion"]["label"] for ann in annotations]
})

# Cast to Audio feature
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
```

### Image Classification

```python
from datasets import Image

# For image annotation tasks
dataset = Dataset.from_dict({
    "image": [ann["image"] for ann in annotations],
    "label": [ann["label_annotations"]["category"]["label"] for ann in annotations]
})

dataset = dataset.cast_column("image", Image())
```

## Multi-Annotator Export

When you have multiple annotators per item, you can export in different formats:

```python
# Long format (one row per annotation)
# Each annotator's work is saved in a separate file: annotator_{id}.jsonl
import glob

records = []
for filepath in glob.glob("annotation_output/annotator_*.jsonl"):
    annotator_id = filepath.split("_")[-1].replace(".jsonl", "")
    with open(filepath) as f:
        for line in f:
            ann = json.loads(line)
            records.append({
                "id": ann["id"],
                "text": ann["text"],
                "label": ann["label_annotations"]["sentiment"]["label"],
                "annotator": annotator_id
            })

dataset = Dataset.from_list(records)

# Or aggregate annotations per item
from collections import defaultdict
from statistics import mode

items = defaultdict(list)
for record in records:
    items[record["id"]].append(record)

aggregated = []
for item_id, anns in items.items():
    labels = [a["label"] for a in anns]
    aggregated.append({
        "id": item_id,
        "text": anns[0]["text"],
        "label": mode(labels),  # Majority vote
        "num_annotators": len(labels)
    })

dataset = Dataset.from_list(aggregated)
```

## Pushing to Hugging Face Hub

```python
from huggingface_hub import login

# Login (or use HF_TOKEN env var)
login()

# Push dataset
dataset.push_to_hub(
    "username/my-sentiment-dataset",
    private=False,
    token=None  # Uses cached token
)

# With dataset card
dataset.push_to_hub(
    "username/my-sentiment-dataset",
    commit_message="Initial upload of sentiment annotations",
)
```

### Dataset Card

Create `README.md` for your dataset:

```markdown
---
license: cc-by-4.0
task_categories:
  - text-classification
language:
  - en
size_categories:
  - 1K<n<10K
---

# My Sentiment Dataset

## Dataset Description

Sentiment annotations collected using [Potato](https://potato.iro.umich.edu).

## Dataset Structure

- **train**: 8,000 examples
- **validation**: 1,000 examples
- **test**: 1,000 examples

### Labels

- Positive
- Negative
- Neutral

## Annotation Process

Annotated by 3 workers per item on Prolific.
Inter-annotator agreement (Fleiss' Kappa): 0.75

## Citation

@article{...}
```

## Loading Your Dataset

```python
from datasets import load_dataset

# From Hub
dataset = load_dataset("username/my-sentiment-dataset")

# From local
dataset = load_dataset("my_dataset/")

# Use for training
from transformers import Trainer

trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    ...
)
```

## A few habits worth keeping

Write down where the data came from, how it was annotated, and what agreement you measured, since the people who reuse the dataset will ask. Define each label in plain words rather than assuming the name is self-explanatory. Version the dataset so you can tell what changed between releases. Credit the annotation methodology. And state the license up front, so nobody has to guess whether they're allowed to use it.

For the export options Potato provides directly, see the [Hugging Face export documentation](https://github.com/davidjurgens/potato/blob/master/docs/data-export/huggingface_export.md).

---

*Full export documentation at [/docs/core-concepts/data-formats](/docs/core-concepts/data-formats).*
