# Understanding Potato Data Formats

Source: https://www.potatoannotator.com/blog/data-format-guide

Potato reads input data and writes annotations in JSON and JSONL. There isn't much to it, but the details matter once your dataset grows or you start mixing text with images and audio. This guide walks through the formats with working examples for each data type.

## Input data formats

### JSON Lines (JSONL)

One JSON object per line. This is the format to reach for first:

```json
{"id": "001", "text": "First document text here."}
{"id": "002", "text": "Second document text here."}
{"id": "003", "text": "Third document text here."}
```

Why JSONL: you can stream it line by line instead of loading the whole file into memory, appending a new record is just adding a line, and one malformed line won't take down the rest of the file.

### JSON array

Standard JSON array:

```json
[
  {"id": "001", "text": "First document."},
  {"id": "002", "text": "Second document."},
  {"id": "003", "text": "Third document."}
]
```

**Configuration:**

```yaml
data_files:
  - data/items.json
```

## Text Annotation Data

### Basic Text

```json
{"id": "doc_001", "text": "The product quality exceeded my expectations."}
```

### With Metadata

```json
{
  "id": "review_001",
  "text": "Great product, fast shipping!",
  "metadata": {
    "source": "amazon",
    "date": "2024-01-15",
    "author": "user123",
    "rating": 5
  }
}
```

### With Pre-annotations

```json
{
  "id": "ner_001",
  "text": "Apple announced new products in Cupertino.",
  "pre_annotations": {
    "entities": [
      {"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
      {"start": 31, "end": 40, "label": "LOC", "text": "Cupertino"}
    ]
  }
}
```

**Configuration:**

```yaml
data_files:
  - data/texts.json

item_properties:
  id_key: id
  text_key: text
```

## Image Annotation Data

### Local Images

```json
{
  "id": "img_001",
  "image_path": "/data/images/photo_001.jpg",
  "caption": "Street scene in Paris"
}
```

### Remote Images

```json
{
  "id": "img_002",
  "image_url": "https://example.com/images/photo.jpg"
}
```

### With Bounding Boxes

```json
{
  "id": "detection_001",
  "image_path": "/images/street.jpg",
  "pre_annotations": {
    "objects": [
      {"bbox": [100, 150, 200, 300], "label": "person"},
      {"bbox": [350, 200, 450, 280], "label": "car"}
    ]
  }
}
```

**Configuration:**

```yaml
data_files:
  - data/images.json

item_properties:
  id_key: id
  image_key: image_path  # or image_url
```

## Audio Annotation Data

### Local Audio

```json
{
  "id": "audio_001",
  "audio_path": "/data/audio/recording.wav",
  "duration": 45.5,
  "transcript": "Hello, how are you today?"
}
```

### With Segments

```json
{
  "id": "audio_002",
  "audio_path": "/audio/meeting.mp3",
  "segments": [
    {"start": 0.0, "end": 5.5, "speaker": "Speaker1"},
    {"start": 5.5, "end": 12.0, "speaker": "Speaker2"}
  ]
}
```

**Configuration:**

```yaml
data_files:
  - data/audio.json

item_properties:
  audio_key: audio_path
  text_key: transcript
```

## Multimodal Data

### Text + Image

```json
{
  "id": "mm_001",
  "text": "What is shown in this image?",
  "image_path": "/images/scene.jpg"
}
```

### Text + Audio

```json
{
  "id": "mm_002",
  "text": "Transcribe this audio:",
  "audio_path": "/audio/clip.wav",
  "reference_transcript": "Expected transcription here"
}
```

## Output Annotation Format

### Basic Output

```json
{
  "id": "doc_001",
  "text": "Great product!",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 5
  },
  "annotator": "user123",
  "timestamp": "2024-11-05T10:30:00Z"
}
```

### Span Annotations

```json
{
  "id": "ner_001",
  "text": "Apple CEO Tim Cook visited Paris.",
  "annotations": {
    "entities": [
      {"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
      {"start": 10, "end": 18, "label": "PERSON", "text": "Tim Cook"},
      {"start": 27, "end": 32, "label": "LOC", "text": "Paris"}
    ]
  }
}
```

### Multiple Annotators

```json
{
  "id": "item_001",
  "text": "Sample text",
  "annotations": [
    {
      "annotator": "ann1",
      "labels": {"sentiment": "Positive"},
      "timestamp": "2024-11-05T10:00:00Z"
    },
    {
      "annotator": "ann2",
      "labels": {"sentiment": "Positive"},
      "timestamp": "2024-11-05T11:00:00Z"
    }
  ],
  "aggregated": {
    "sentiment": "Positive",
    "agreement": 1.0
  }
}
```

## Configuration Reference

```yaml
data_files:
  - data/items.json

item_properties:
  id_key: id
  text_key: text
  image_key: image_path
  audio_key: audio_path
```

## A few habits that save headaches

Give every instance a unique ID; you will want it the moment you need to trace an annotation back to its source. Reach for JSONL once the file gets big. Validate the JSON before you load it, because a single stray comma fails late and confusingly. Keep metadata like source, date, and author around even if you don't use it now, since it makes debugging far easier later. And pick field names and stick to them, so downstream scripts don't need a special case for every file.

For the full list of supported keys, see the [data format documentation](https://github.com/davidjurgens/potato/blob/master/docs/configuration/data_format.md).

---

*Full data format documentation at [/docs/core-concepts/data-formats](/docs/core-concepts/data-formats).*
