Data Formats
Supported data formats and how to structure your annotation data.
Data Formats
Potato supports multiple data formats out of the box. This guide explains how to structure your data for annotation.
Supported Formats
| Format | Extension | Description |
|---|---|---|
| JSON | .json | Array of objects |
| JSON Lines | .jsonl | One JSON object per line |
| CSV | .csv | Comma-separated values |
| TSV | .tsv | Tab-separated values |
JSON Format
The most common format. Your data should be an array of objects:
[
{
"id": "doc_001",
"text": "This is the first document to annotate.",
"source": "twitter",
"date": "2024-01-15"
},
{
"id": "doc_002",
"text": "This is the second document.",
"source": "reddit",
"date": "2024-01-16"
}
]JSON Lines Format
Each line is a separate JSON object. Useful for large datasets:
{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}CSV/TSV Format
Tabular data with headers:
id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",redditConfiguration
Basic Setup
Configure data files and field mappings in your YAML:
data_files:
- "data/documents.json"
item_properties:
id_key: id # Field name for unique ID
text_key: text # Field name for content to annotateMultiple Data Files
Combine multiple data sources:
data_files:
- "data/batch_1.json"
- "data/batch_2.json"
- "data/batch_3.jsonl"Files are processed in order and combined.
Data Types
Plain Text
Simple text content:
{
"id": "1",
"text": "The product arrived quickly and works great!"
}Media Files
Reference images, videos, or audio:
{
"id": "1",
"image_path": "images/photo_001.jpg"
}item_properties:
id_key: id
image_key: image_pathDialogue/Lists
Lists are automatically displayed horizontally:
{
"id": "1",
"text": "Option A,Option B,Option C"
}Text Pairs
For comparison tasks:
{
"id": "pair_001",
"text": {
"A": "Response from Model A",
"B": "Response from Model B"
}
}HTML Files
Reference HTML files stored in folders:
{
"id": "1",
"html_file": "html/document_001.html"
}Contextual Annotation
Include context alongside the main text:
{
"id": "1",
"text": "This is great!",
"context": "Previous message: How do you like the new feature?"
}item_properties:
id_key: id
text_key: text
context_key: contextDisplay Configuration
List Display Options
Control how lists and dictionaries are displayed:
list_as_text:
# Add prefixes to items
text_prefix: "A" # A., B., C. (or "1" for 1., 2., 3.)
# Display orientation
horizontal: true # Side-by-side (false for vertical)
# Randomization
randomize_values: true # Shuffle list items
randomize_keys: true # Shuffle dictionary keysHTML Content
Enable HTML rendering in text:
html_content: true{
"id": "1",
"text": "<p>This is <strong>formatted</strong> text.</p>"
}Output Configuration
Output Directory
Specify where annotations are saved:
output_annotation_dir: "output/"Output Format
Choose the output format:
output_annotation_format: "json" # json, jsonl, csv, or tsvOutput Structure
Annotations include document ID and responses:
{
"id": "doc_001",
"user": "annotator_1",
"annotations": {
"sentiment": "Positive",
"confidence": 4
},
"timestamp": "2024-01-15T10:30:00Z"
}Special Data Types
Best-Worst Scaling
For ranking tasks, use comma-separated items:
{
"id": "1",
"text": "Item A,Item B,Item C,Item D"
}Custom Arguments
Include extra fields for display or filtering:
{
"id": "1",
"text": "Document content",
"category": "news",
"priority": "high",
"custom_field": "any value"
}Database Backend
For large datasets, use MySQL:
database:
type: mysql
host: localhost
database: potato_db
user: ${DB_USER}
password: ${DB_PASSWORD}Potato automatically creates required tables on first startup.
Data Validation
Potato validates your data on startup:
- Missing ID field - All items need unique identifiers
- Missing text field - Items need content to annotate
- Duplicate IDs - All IDs must be unique
- File not found - Verify data file paths
Complete Example
task_name: "Document Classification"
task_dir: "."
port: 8000
# Data configuration
data_files:
- "data/documents.json"
item_properties:
id_key: id
text_key: text
context_key: metadata
# Display settings
list_as_text:
text_prefix: "1"
horizontal: false
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
# Annotation scheme
annotation_schemes:
- annotation_type: radio
name: category
description: "Select the document category"
labels:
- News
- Opinion
- Tutorial
- Other
allow_all_users: trueBest Practices
1. Use Meaningful IDs
Makes tracking and debugging easier:
{"id": "twitter_2024_001", "text": "..."}2. Keep Text Concise
Long texts slow down annotation. Consider:
- Truncating to key portions
- Providing summaries
- Using scroll containers
3. Include Metadata
Helps with filtering and analysis:
{
"id": "1",
"text": "Content",
"source": "twitter",
"date": "2024-01-15",
"language": "en"
}4. Validate Before Loading
Check your data offline:
import json
with open('data.json') as f:
data = json.load(f)
# Check for required fields
for item in data:
assert 'id' in item, f"Missing id: {item}"
assert 'text' in item, f"Missing text: {item}"
# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
print(f"Validated {len(data)} items")5. Backup Original Data
Keep raw data separate from annotations for reproducibility.
6. Use JSON Lines for Large Files
More memory-efficient than JSON arrays:
# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl