Data Formats

Supported data formats and how to structure your annotation data.

Data Formats

Potato supports multiple data formats out of the box. This guide explains how to structure your data for annotation.

Supported Formats

Format	Extension	Description
JSON	`.json`	Array of objects
JSON Lines	`.jsonl`	One JSON object per line
CSV	`.csv`	Comma-separated values
TSV	`.tsv`	Tab-separated values

JSON Format

The most common format. Your data should be an array of objects:

json

[
  {
    "id": "doc_001",
    "text": "This is the first document to annotate.",
    "source": "twitter",
    "date": "2024-01-15"
  },
  {
    "id": "doc_002",
    "text": "This is the second document.",
    "source": "reddit",
    "date": "2024-01-16"
  }
]

JSON Lines Format

Each line is a separate JSON object. Useful for large datasets:

jsonl

{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}

CSV/TSV Format

Tabular data with headers:

csv

id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",reddit

Configuration

Basic Setup

Configure data files and field mappings in your YAML:

yaml

data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id      # Field name for unique ID
  text_key: text  # Field name for content to annotate

Multiple Data Files

Combine multiple data sources:

yaml

data_files:
  - "data/batch_1.json"
  - "data/batch_2.json"
  - "data/batch_3.jsonl"

Files are processed in order and combined.

Data Types

Plain Text

Simple text content:

json

{
  "id": "1",
  "text": "The product arrived quickly and works great!"
}

Media Files

Reference images, videos, or audio:

json

{
  "id": "1",
  "image_path": "images/photo_001.jpg"
}

yaml

item_properties:
  id_key: id
  image_key: image_path

Dialogue/Lists

Lists are automatically displayed horizontally:

json

{
  "id": "1",
  "text": "Option A,Option B,Option C"
}

Text Pairs

For comparison tasks:

json

{
  "id": "pair_001",
  "text": {
    "A": "Response from Model A",
    "B": "Response from Model B"
  }
}

HTML Files

Reference HTML files stored in folders:

json

{
  "id": "1",
  "html_file": "html/document_001.html"
}

Contextual Annotation

Include context alongside the main text:

json

{
  "id": "1",
  "text": "This is great!",
  "context": "Previous message: How do you like the new feature?"
}

yaml

item_properties:
  id_key: id
  text_key: text
  context_key: context

Display Configuration

List Display Options

Control how lists and dictionaries are displayed:

yaml

list_as_text:
  # Add prefixes to items
  text_prefix: "A"  # A., B., C. (or "1" for 1., 2., 3.)
 
  # Display orientation
  horizontal: true  # Side-by-side (false for vertical)
 
  # Randomization
  randomize_values: true   # Shuffle list items
  randomize_keys: true     # Shuffle dictionary keys

HTML Content

Enable HTML rendering in text:

yaml

html_content: true

json

{
  "id": "1",
  "text": "<p>This is <strong>formatted</strong> text.</p>"
}

Output Configuration

Output Directory

Specify where annotations are saved:

yaml

output_annotation_dir: "output/"

Output Format

Choose the output format:

yaml

output_annotation_format: "json"  # json, jsonl, csv, or tsv

Output Structure

Annotations include document ID and responses:

json

{
  "id": "doc_001",
  "user": "annotator_1",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 4
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

Special Data Types

Best-Worst Scaling

For ranking tasks, use comma-separated items:

json

{
  "id": "1",
  "text": "Item A,Item B,Item C,Item D"
}

Custom Arguments

Include extra fields for display or filtering:

json

{
  "id": "1",
  "text": "Document content",
  "category": "news",
  "priority": "high",
  "custom_field": "any value"
}

Database Backend

For large datasets, use MySQL:

yaml

database:
  type: mysql
  host: localhost
  database: potato_db
  user: ${DB_USER}
  password: ${DB_PASSWORD}

Potato automatically creates required tables on first startup.

Data Validation

Potato validates your data on startup:

Missing ID field - All items need unique identifiers
Missing text field - Items need content to annotate
Duplicate IDs - All IDs must be unique
File not found - Verify data file paths

Complete Example

yaml

task_name: "Document Classification"
task_dir: "."
port: 8000
 
# Data configuration
data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id
  text_key: text
  context_key: metadata
 
# Display settings
list_as_text:
  text_prefix: "1"
  horizontal: false
 
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
 
# Annotation scheme
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "Select the document category"
    labels:
      - News
      - Opinion
      - Tutorial
      - Other
 
allow_all_users: true

Best Practices

1. Use Meaningful IDs

Makes tracking and debugging easier:

json

{"id": "twitter_2024_001", "text": "..."}

2. Keep Text Concise

Long texts slow down annotation. Consider:

Truncating to key portions
Providing summaries
Using scroll containers

3. Include Metadata

Helps with filtering and analysis:

json

{
  "id": "1",
  "text": "Content",
  "source": "twitter",
  "date": "2024-01-15",
  "language": "en"
}

4. Validate Before Loading

Check your data offline:

python

import json
 
with open('data.json') as f:
    data = json.load(f)
 
# Check for required fields
for item in data:
    assert 'id' in item, f"Missing id: {item}"
    assert 'text' in item, f"Missing text: {item}"
 
# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
 
print(f"Validated {len(data)} items")

5. Backup Original Data

Keep raw data separate from annotations for reproducibility.

6. Use JSON Lines for Large Files

More memory-efficient than JSON arrays:

bash

# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl

Data Formats

Data Formats

Supported Formats

JSON Format

JSON Lines Format

CSV/TSV Format

Configuration

Basic Setup

Multiple Data Files

Data Types

Plain Text

Media Files

Dialogue/Lists

Text Pairs

HTML Files

Contextual Annotation

Display Configuration

List Display Options

HTML Content

Output Configuration

Output Directory

Output Format

Output Structure

Special Data Types

Best-Worst Scaling

Custom Arguments

Database Backend

Data Validation

Complete Example

Best Practices

1. Use Meaningful IDs

2. Keep Text Concise

3. Include Metadata

4. Validate Before Loading

5. Backup Original Data

6. Use JSON Lines for Large Files

Further Reading