Guides3 min read
Understanding Potato Data Formats
A deep dive into JSON and JSONL data formats, with examples for text, image, audio, and multimodal annotation.
Potato Team·
Understanding Potato Data Formats
Potato uses JSON and JSONL formats for input data and output annotations. This guide covers format specifications, examples, and best practices for all data types.
Input Data Formats
JSON Lines (JSONL) - Recommended
One JSON object per line:
json
{"id": "001", "text": "First document text here."}
{"id": "002", "text": "Second document text here."}
{"id": "003", "text": "Third document text here."}Advantages:
- Stream processing (memory efficient)
- Easy to append
- One corrupted line doesn't break file
JSON Array
Standard JSON array:
json
[
{"id": "001", "text": "First document."},
{"id": "002", "text": "Second document."},
{"id": "003", "text": "Third document."}
]Configuration:
yaml
data_files:
- data/items.jsonText Annotation Data
Basic Text
json
{"id": "doc_001", "text": "The product quality exceeded my expectations."}With Metadata
json
{
"id": "review_001",
"text": "Great product, fast shipping!",
"metadata": {
"source": "amazon",
"date": "2024-01-15",
"author": "user123",
"rating": 5
}
}With Pre-annotations
json
{
"id": "ner_001",
"text": "Apple announced new products in Cupertino.",
"pre_annotations": {
"entities": [
{"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
{"start": 31, "end": 40, "label": "LOC", "text": "Cupertino"}
]
}
}Configuration:
yaml
data_files:
- data/texts.json
item_properties:
id_key: id
text_key: textImage Annotation Data
Local Images
json
{
"id": "img_001",
"image_path": "/data/images/photo_001.jpg",
"caption": "Street scene in Paris"
}Remote Images
json
{
"id": "img_002",
"image_url": "https://example.com/images/photo.jpg"
}With Bounding Boxes
json
{
"id": "detection_001",
"image_path": "/images/street.jpg",
"pre_annotations": {
"objects": [
{"bbox": [100, 150, 200, 300], "label": "person"},
{"bbox": [350, 200, 450, 280], "label": "car"}
]
}
}Configuration:
yaml
data_files:
- data/images.json
item_properties:
id_key: id
image_key: image_path # or image_urlAudio Annotation Data
Local Audio
json
{
"id": "audio_001",
"audio_path": "/data/audio/recording.wav",
"duration": 45.5,
"transcript": "Hello, how are you today?"
}With Segments
json
{
"id": "audio_002",
"audio_path": "/audio/meeting.mp3",
"segments": [
{"start": 0.0, "end": 5.5, "speaker": "Speaker1"},
{"start": 5.5, "end": 12.0, "speaker": "Speaker2"}
]
}Configuration:
yaml
data_files:
- data/audio.json
item_properties:
audio_key: audio_path
text_key: transcriptMultimodal Data
Text + Image
json
{
"id": "mm_001",
"text": "What is shown in this image?",
"image_path": "/images/scene.jpg"
}Text + Audio
json
{
"id": "mm_002",
"text": "Transcribe this audio:",
"audio_path": "/audio/clip.wav",
"reference_transcript": "Expected transcription here"
}Output Annotation Format
Basic Output
json
{
"id": "doc_001",
"text": "Great product!",
"annotations": {
"sentiment": "Positive",
"confidence": 5
},
"annotator": "user123",
"timestamp": "2024-11-05T10:30:00Z"
}Span Annotations
json
{
"id": "ner_001",
"text": "Apple CEO Tim Cook visited Paris.",
"annotations": {
"entities": [
{"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
{"start": 10, "end": 18, "label": "PERSON", "text": "Tim Cook"},
{"start": 27, "end": 32, "label": "LOC", "text": "Paris"}
]
}
}Multiple Annotators
json
{
"id": "item_001",
"text": "Sample text",
"annotations": [
{
"annotator": "ann1",
"labels": {"sentiment": "Positive"},
"timestamp": "2024-11-05T10:00:00Z"
},
{
"annotator": "ann2",
"labels": {"sentiment": "Positive"},
"timestamp": "2024-11-05T11:00:00Z"
}
],
"aggregated": {
"sentiment": "Positive",
"agreement": 1.0
}
}Configuration Reference
yaml
data_files:
- data/items.json
item_properties:
id_key: id
text_key: text
image_key: image_path
audio_key: audio_pathBest Practices
- Always include IDs: Unique identifiers for tracking
- Use JSONL for large datasets: Better memory efficiency
- Validate before loading: Check JSON syntax
- Include metadata: Source, date, author help debugging
- Consistent field names: Easier processing downstream
Full data format documentation at /docs/core-concepts/data-formats.