Parquet Export
Export annotations to Apache Parquet format for efficient large-scale data processing.
Parquet Export
New in v2.3.0
Apache Parquet is a columnar storage format optimized for analytical workloads. It offers significant advantages over JSON and CSV for large annotation datasets: smaller file sizes (typically 5-10x compression), faster reads for column-subset queries, and native support in virtually every data science tool (pandas, DuckDB, PyArrow, Spark, Polars, Hugging Face Datasets).
Potato can export annotations directly to Parquet format, producing three structured files that cover all annotation types.
Enabling Parquet Export
As Primary Output Format
output_annotation_dir: "output/"
output_annotation_format: "parquet"As Secondary Export (Keep JSON Primary)
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
parquet_export:
enabled: true
output_dir: "output/parquet/"
auto_export: true # export after each annotation sessionOn-Demand via CLI
python -m potato.export parquet --config config.yaml --output ./parquet_output/Output Files
Parquet export produces three files, each representing a different level of the annotation data.
1. annotations.parquet
The primary output file. One row per (instance, annotator, schema) combination.
| Column | Type | Description |
|---|---|---|
instance_id | string | Instance identifier |
annotator | string | Annotator username |
schema_name | string | Annotation schema name |
value | string | Annotation value (JSON-encoded for complex types) |
timestamp | timestamp | When the annotation was created |
duration_ms | int64 | Time spent on this instance (milliseconds) |
session_id | string | Annotation session identifier |
For simple annotation types (radio, likert, text), value contains the raw value. For complex types (multiselect, spans, events), value contains a JSON string.
2. spans.parquet
For span-based annotation types (span, span_link, event_annotation, coreference). One row per annotated span.
| Column | Type | Description |
|---|---|---|
instance_id | string | Instance identifier |
annotator | string | Annotator username |
schema_name | string | Annotation schema name |
span_id | string | Unique span identifier |
text | string | Span text content |
start_offset | int32 | Character start offset |
end_offset | int32 | Character end offset |
label | string | Span label |
field | string | Source field (for multi-field span annotation) |
links | string | JSON-encoded link data (for span_link) |
attributes | string | JSON-encoded additional attributes |
3. items.parquet
Metadata about each instance in the dataset. One row per instance.
| Column | Type | Description |
|---|---|---|
instance_id | string | Instance identifier |
text | string | Primary text content |
annotation_count | int32 | Number of annotations received |
annotators | string | JSON list of annotator usernames |
status | string | Instance status (pending, in_progress, complete) |
metadata | string | JSON-encoded instance metadata |
Compression Options
parquet_export:
enabled: true
output_dir: "output/parquet/"
compression: snappy # snappy (default), gzip, zstd, lz4, brotli, none
row_group_size: 50000 # rows per row group (affects read performance)
use_dictionary: true # dictionary encoding for string columns
write_statistics: true # column statistics for query optimizationCompression Comparison
| Algorithm | Compression Ratio | Write Speed | Read Speed | Best For |
|---|---|---|---|---|
snappy | Moderate | Fast | Fast | General use (default) |
gzip | High | Slow | Moderate | Archival, small files |
zstd | High | Fast | Fast | Best balance of size and speed |
lz4 | Low | Very Fast | Very Fast | Speed-critical workloads |
brotli | Very High | Very Slow | Moderate | Maximum compression |
none | None | Fastest | Fastest | Debugging |
For most annotation projects, the default snappy compression is a good choice. For large datasets where file size matters, use zstd.
Loading Parquet Data
pandas
import pandas as pd
annotations = pd.read_parquet("output/parquet/annotations.parquet")
spans = pd.read_parquet("output/parquet/spans.parquet")
items = pd.read_parquet("output/parquet/items.parquet")
# Filter to a specific schema
sentiment = annotations[annotations["schema_name"] == "sentiment"]
# Compute inter-annotator agreement
from sklearn.metrics import cohen_kappa_score
pivot = sentiment.pivot(index="instance_id", columns="annotator", values="value")
kappa = cohen_kappa_score(pivot.iloc[:, 0], pivot.iloc[:, 1])DuckDB
-- Direct query without loading into memory
SELECT instance_id, value, COUNT(*) as annotator_count
FROM 'output/parquet/annotations.parquet'
WHERE schema_name = 'sentiment'
GROUP BY instance_id, value
ORDER BY annotator_count DESC;
-- Join annotations with items
SELECT a.instance_id, i.text, a.value, a.annotator
FROM 'output/parquet/annotations.parquet' a
JOIN 'output/parquet/items.parquet' i
ON a.instance_id = i.instance_id
WHERE a.schema_name = 'sentiment';PyArrow
import pyarrow.parquet as pq
# Read specific columns only (fast for wide tables)
table = pq.read_table(
"output/parquet/annotations.parquet",
columns=["instance_id", "value", "annotator"]
)
# Convert to pandas
df = table.to_pandas()
# Read with row group filtering
parquet_file = pq.ParquetFile("output/parquet/annotations.parquet")
print(f"Row groups: {parquet_file.metadata.num_row_groups}")
print(f"Total rows: {parquet_file.metadata.num_rows}")Hugging Face Datasets
from datasets import load_dataset
# Load directly from Parquet files
dataset = load_dataset("parquet", data_files={
"annotations": "output/parquet/annotations.parquet",
"spans": "output/parquet/spans.parquet",
"items": "output/parquet/items.parquet",
})
# Access as a regular HF dataset
print(dataset["annotations"][0])
# Push to Hugging Face Hub
dataset["annotations"].push_to_hub("my-org/my-annotations", split="train")Polars
import polars as pl
annotations = pl.read_parquet("output/parquet/annotations.parquet")
# Fast aggregation
label_counts = (
annotations
.filter(pl.col("schema_name") == "sentiment")
.group_by("value")
.agg(pl.count().alias("count"))
.sort("count", descending=True)
)
print(label_counts)Incremental Export
For long-running annotation projects, enable incremental export to avoid re-exporting the entire dataset each time:
parquet_export:
enabled: true
output_dir: "output/parquet/"
incremental: true
partition_by: date # date, annotator, or noneWith partition_by: date, Parquet files are organized into date-partitioned directories:
output/parquet/
annotations/
date=2026-03-01/part-0.parquet
date=2026-03-02/part-0.parquet
date=2026-03-03/part-0.parquet
spans/
date=2026-03-01/part-0.parquet
items/
part-0.parquet
Partitioned datasets can be read as a single logical table by all major tools:
# pandas reads partitioned directories automatically
df = pd.read_parquet("output/parquet/annotations/")
# DuckDB handles partitions natively
# SELECT * FROM 'output/parquet/annotations/**/*.parquet'Configuration Reference
parquet_export:
enabled: true
output_dir: "output/parquet/"
# When to export
auto_export: true # export after each session (default: false)
export_on_shutdown: true # export when server stops (default: true)
# File settings
compression: snappy
row_group_size: 50000
use_dictionary: true
write_statistics: true
# Incremental settings
incremental: false
partition_by: none # none, date, annotator
# Schema-specific options
flatten_complex_types: false # flatten JSON values into columns
include_raw_json: true # include raw JSON alongside flattened columns
# Span export
export_spans: true # generate spans.parquet
export_items: true # generate items.parquetFull Example
task_name: "NER Annotation Project"
task_dir: "."
data_files:
- "data/documents.jsonl"
item_properties:
id_key: doc_id
text_key: text
annotation_schemes:
- annotation_type: span
name: entities
labels:
- name: PERSON
color: "#3b82f6"
- name: ORGANIZATION
color: "#22c55e"
- name: LOCATION
color: "#f59e0b"
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
parquet_export:
enabled: true
output_dir: "output/parquet/"
compression: zstd
auto_export: true
export_spans: true
export_items: trueAfter annotation, load and analyze:
import pandas as pd
spans = pd.read_parquet("output/parquet/spans.parquet")
# Entity type distribution
print(spans["label"].value_counts())
# Average span length by type
spans["length"] = spans["end_offset"] - spans["start_offset"]
print(spans.groupby("label")["length"].mean())Further Reading
- Export Formats -- COCO, YOLO, CoNLL, and other export formats
- Remote Data Sources -- loading data from cloud storage
- Admin Dashboard -- monitoring export status
For implementation details, see the source documentation.