Docs/Core Concepts

Data Formats

Supported data formats and how to structure your annotation data.

Data Formats

Potato supports multiple data formats out of the box. This guide explains how to structure your data for annotation.

Supported Formats

FormatExtensionDescription
JSON.jsonArray of objects
JSON Lines.jsonlOne JSON object per line
CSV.csvComma-separated values
TSV.tsvTab-separated values

JSON Format

The most common format. Your data should be an array of objects:

[
  {
    "id": "doc_001",
    "text": "This is the first document to annotate.",
    "source": "twitter",
    "date": "2024-01-15"
  },
  {
    "id": "doc_002",
    "text": "This is the second document.",
    "source": "reddit",
    "date": "2024-01-16"
  }
]

JSON Lines Format

Each line is a separate JSON object. Useful for large datasets:

{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}

CSV/TSV Format

Tabular data with headers:

id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",reddit

Configuration

Basic Setup

Configure data files and field mappings in your YAML:

data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id      # Field name for unique ID
  text_key: text  # Field name for content to annotate

Multiple Data Files

Combine multiple data sources:

data_files:
  - "data/batch_1.json"
  - "data/batch_2.json"
  - "data/batch_3.jsonl"

Files are processed in order and combined.

Data Types

Plain Text

Simple text content:

{
  "id": "1",
  "text": "The product arrived quickly and works great!"
}

Media Files

Reference images, videos, or audio:

{
  "id": "1",
  "image_path": "images/photo_001.jpg"
}
item_properties:
  id_key: id
  image_key: image_path

Dialogue/Lists

Lists are automatically displayed horizontally:

{
  "id": "1",
  "text": "Option A,Option B,Option C"
}

Text Pairs

For comparison tasks:

{
  "id": "pair_001",
  "text": {
    "A": "Response from Model A",
    "B": "Response from Model B"
  }
}

HTML Files

Reference HTML files stored in folders:

{
  "id": "1",
  "html_file": "html/document_001.html"
}

Contextual Annotation

Include context alongside the main text:

{
  "id": "1",
  "text": "This is great!",
  "context": "Previous message: How do you like the new feature?"
}
item_properties:
  id_key: id
  text_key: text
  context_key: context

Display Configuration

List Display Options

Control how lists and dictionaries are displayed:

list_as_text:
  # Add prefixes to items
  text_prefix: "A"  # A., B., C. (or "1" for 1., 2., 3.)
 
  # Display orientation
  horizontal: true  # Side-by-side (false for vertical)
 
  # Randomization
  randomize_values: true   # Shuffle list items
  randomize_keys: true     # Shuffle dictionary keys

HTML Content

Enable HTML rendering in text:

html_content: true
{
  "id": "1",
  "text": "<p>This is <strong>formatted</strong> text.</p>"
}

Output Configuration

Output Directory

Specify where annotations are saved:

output_annotation_dir: "output/"

Output Format

Choose the output format:

output_annotation_format: "json"  # json, jsonl, csv, or tsv

Output Structure

Annotations include document ID and responses:

{
  "id": "doc_001",
  "user": "annotator_1",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 4
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

Special Data Types

Best-Worst Scaling

For ranking tasks, use comma-separated items:

{
  "id": "1",
  "text": "Item A,Item B,Item C,Item D"
}

Custom Arguments

Include extra fields for display or filtering:

{
  "id": "1",
  "text": "Document content",
  "category": "news",
  "priority": "high",
  "custom_field": "any value"
}

Database Backend

For large datasets, use MySQL:

database:
  type: mysql
  host: localhost
  database: potato_db
  user: ${DB_USER}
  password: ${DB_PASSWORD}

Potato automatically creates required tables on first startup.

Data Validation

Potato validates your data on startup:

  • Missing ID field - All items need unique identifiers
  • Missing text field - Items need content to annotate
  • Duplicate IDs - All IDs must be unique
  • File not found - Verify data file paths

Complete Example

task_name: "Document Classification"
task_dir: "."
port: 8000
 
# Data configuration
data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id
  text_key: text
  context_key: metadata
 
# Display settings
list_as_text:
  text_prefix: "1"
  horizontal: false
 
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
 
# Annotation scheme
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "Select the document category"
    labels:
      - News
      - Opinion
      - Tutorial
      - Other
 
allow_all_users: true

Best Practices

1. Use Meaningful IDs

Makes tracking and debugging easier:

{"id": "twitter_2024_001", "text": "..."}

2. Keep Text Concise

Long texts slow down annotation. Consider:

  • Truncating to key portions
  • Providing summaries
  • Using scroll containers

3. Include Metadata

Helps with filtering and analysis:

{
  "id": "1",
  "text": "Content",
  "source": "twitter",
  "date": "2024-01-15",
  "language": "en"
}

4. Validate Before Loading

Check your data offline:

import json
 
with open('data.json') as f:
    data = json.load(f)
 
# Check for required fields
for item in data:
    assert 'id' in item, f"Missing id: {item}"
    assert 'text' in item, f"Missing text: {item}"
 
# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
 
print(f"Validated {len(data)} items")

5. Backup Original Data

Keep raw data separate from annotations for reproducibility.

6. Use JSON Lines for Large Files

More memory-efficient than JSON arrays:

# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl