# Data Formats

Source: https://www.potatoannotator.com/docs/core-concepts/data-formats

Potato supports multiple data formats out of the box. This guide explains how to structure your data for annotation.

## Supported Formats

| Format | Extension | Description |
|--------|-----------|-------------|
| JSON | `.json` | Array of objects |
| JSON Lines | `.jsonl` | One JSON object per line |
| CSV | `.csv` | Comma-separated values |
| TSV | `.tsv` | Tab-separated values |

## JSON Format

The most common format. Your data should be an array of objects:

```json
[
  {
    "id": "doc_001",
    "text": "This is the first document to annotate.",
    "source": "twitter",
    "date": "2024-01-15"
  },
  {
    "id": "doc_002",
    "text": "This is the second document.",
    "source": "reddit",
    "date": "2024-01-16"
  }
]
```

## JSON Lines Format

Each line is a separate JSON object. Useful for large datasets:

```jsonl
{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}
```

## CSV/TSV Format

Tabular data with headers:

```csv
id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",reddit
```

## Configuration

### Basic Setup

Configure data files and field mappings in your YAML:

```yaml
data_files:
  - "data/documents.json"

item_properties:
  id_key: id      # Field name for unique ID
  text_key: text  # Field name for content to annotate
```

### Multiple Data Files

Combine multiple data sources:

```yaml
data_files:
  - "data/batch_1.json"
  - "data/batch_2.json"
  - "data/batch_3.jsonl"
```

Files are processed in order and combined.

## Data Types

### Plain Text

Simple text content:

```json
{
  "id": "1",
  "text": "The product arrived quickly and works great!"
}
```

### Media Files

Reference images, videos, or audio:

```json
{
  "id": "1",
  "image_path": "images/photo_001.jpg"
}
```

```yaml
item_properties:
  id_key: id
  image_key: image_path
```

### Dialogue/Lists

Lists are automatically displayed horizontally:

```json
{
  "id": "1",
  "text": "Option A,Option B,Option C"
}
```

### Text Pairs

For comparison tasks:

```json
{
  "id": "pair_001",
  "text": {
    "A": "Response from Model A",
    "B": "Response from Model B"
  }
}
```

### HTML Files

Reference HTML files stored in folders:

```json
{
  "id": "1",
  "html_file": "html/document_001.html"
}
```

### Contextual Annotation

Include context alongside the main text:

```json
{
  "id": "1",
  "text": "This is great!",
  "context": "Previous message: How do you like the new feature?"
}
```

```yaml
item_properties:
  id_key: id
  text_key: text
  context_key: context
```

## Display Configuration

### List Display Options

Control how lists and dictionaries are displayed:

```yaml
list_as_text:
  # Add prefixes to items
  text_prefix: "A"  # A., B., C. (or "1" for 1., 2., 3.)

  # Display orientation
  horizontal: true  # Side-by-side (false for vertical)

  # Randomization
  randomize_values: true   # Shuffle list items
  randomize_keys: true     # Shuffle dictionary keys
```

### HTML Content

Enable HTML rendering in text:

```yaml
html_content: true
```

```json
{
  "id": "1",
  "text": "<p>This is <strong>formatted</strong> text.</p>"
}
```

## Output Configuration

### Output Directory

Specify where annotations are saved:

```yaml
output_annotation_dir: "output/"
```

### Output Format

Choose the output format:

```yaml
output_annotation_format: "json"  # json, jsonl, csv, or tsv
```

### Output Structure

Annotations include document ID and responses:

```json
{
  "id": "doc_001",
  "user": "annotator_1",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 4
  },
  "timestamp": "2024-01-15T10:30:00Z"
}
```

## Special Data Types

### Best-Worst Scaling

For ranking tasks, use comma-separated items:

```json
{
  "id": "1",
  "text": "Item A,Item B,Item C,Item D"
}
```

### Custom Arguments

Include extra fields for display or filtering:

```json
{
  "id": "1",
  "text": "Document content",
  "category": "news",
  "priority": "high",
  "custom_field": "any value"
}
```

## Database Backend

For large datasets, use MySQL:

```yaml
database:
  type: mysql
  host: localhost
  database: potato_db
  user: ${DB_USER}
  password: ${DB_PASSWORD}
```

Potato automatically creates required tables on first startup.

## Data Validation

Potato validates your data on startup:

- **Missing ID field** - All items need unique identifiers
- **Missing text field** - Items need content to annotate
- **Duplicate IDs** - All IDs must be unique
- **File not found** - Verify data file paths

## Complete Example

```yaml
task_name: "Document Classification"
task_dir: "."
port: 8000

# Data configuration
data_files:
  - "data/documents.json"

item_properties:
  id_key: id
  text_key: text
  context_key: metadata

# Display settings
list_as_text:
  text_prefix: "1"
  horizontal: false

# Output
output_annotation_dir: "output/"
output_annotation_format: "json"

# Annotation scheme
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "Select the document category"
    labels:
      - News
      - Opinion
      - Tutorial
      - Other

allow_all_users: true
```

## Best Practices

### 1. Use Meaningful IDs

Makes tracking and debugging easier:

```json
{"id": "twitter_2024_001", "text": "..."}
```

### 2. Keep Text Concise

Long texts slow down annotation. Consider:
- Truncating to key portions
- Providing summaries
- Using scroll containers

### 3. Include Metadata

Helps with filtering and analysis:

```json
{
  "id": "1",
  "text": "Content",
  "source": "twitter",
  "date": "2024-01-15",
  "language": "en"
}
```

### 4. Validate Before Loading

Check your data offline:

```python
import json

with open('data.json') as f:
    data = json.load(f)

# Check for required fields
for item in data:
    assert 'id' in item, f"Missing id: {item}"
    assert 'text' in item, f"Missing text: {item}"

# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"

print(f"Validated {len(data)} items")
```

### 5. Backup Original Data

Keep raw data separate from annotations for reproducibility.

### 6. Use JSON Lines for Large Files

More memory-efficient than JSON arrays:

```bash
# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl
```

## Further Reading

- [Data Directory Loading](/docs/deployment/data-directory) - Load from directories with live watching
- [Dialogue Annotation](/docs/annotation-types/dialogue-annotation) - Multi-item data display
- [Export Formats](/docs/features/export-formats) - Output format options

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/data_formats.md).
