Data Directory Loading

Configure Potato to load annotation instances from a folder — supports glob patterns, live watching for new files, and recursive subdirectory scanning with filters.

Loading Data from a Directory

Instead of specifying individual data files with data_files, you can point Potato to a directory containing your data files. All supported files will be loaded automatically, with optional live watching for new or modified files.

Overview

This is useful when:

You have many data files and don't want to list them individually
You want to dynamically add new files while the server is running
You're receiving data from an external process that writes to a shared directory

Configuration

Basic Usage (Static Loading)

To load all files from a directory at startup:

yaml

# Load all supported files from this directory
data_directory: "./data/incoming"
 
# data_files can be empty when using data_directory
data_files: []
 
# Required: item_properties must still be configured
item_properties:
  id_key: "id"
  text_key: "text"

Live Directory Watching

To automatically detect and load new or modified files while the server is running:

yaml

data_directory: "./data/incoming"
data_files: []
 
# Enable live watching (default: false)
watch_data_directory: true
 
# Optional: how often to check for changes in seconds (default: 5.0)
watch_poll_interval: 10.0
 
item_properties:
  id_key: "id"
  text_key: "text"

Configuration Options

Option	Type	Default	Description
`data_directory`	string	-	Path to the directory containing data files
`watch_data_directory`	boolean	`false`	Whether to watch for new/modified files
`watch_poll_interval`	number	`5.0`	Seconds between directory scans (min: 1.0)

Supported File Formats

The directory watcher supports the same formats as data_files:

JSON (.json) - One JSON object per line, or a JSON array
JSONL (.jsonl) - JSON Lines format, one object per line
CSV (.csv) - Comma-separated values with header row
TSV (.tsv) - Tab-separated values with header row

How It Works

At Startup

All files with supported extensions in data_directory are scanned
Each file is parsed according to its extension
Instances are added to the annotation queue
If watch_data_directory is enabled, a background thread starts watching

During Runtime (when watching is enabled)

Every watch_poll_interval seconds, the directory is scanned
New files are parsed and their instances are added
Modified files are re-parsed:
- New instances are added
- Existing instances are updated (annotations are preserved)
Removed files: instances remain in the system (to preserve annotations)

Example Directory Structure

text

my_project/
├── configs/
│   └── config.yaml
└── data/
    └── incoming/
        ├── batch_001.jsonl
        ├── batch_002.jsonl
        └── new_data.json    # Added while server is running

Example Data Files

JSONL Format (`batch_001.jsonl`)

json

{"id": "item_001", "text": "First document to annotate."}
{"id": "item_002", "text": "Second document to annotate."}
{"id": "item_003", "text": "Third document to annotate."}

JSON Format (`batch_002.json`)

json

[
  {"id": "item_004", "text": "Fourth document."},
  {"id": "item_005", "text": "Fifth document."}
]

CSV Format (`batch_003.csv`)

csv

id,text,category
item_006,Sixth document to annotate.,news
item_007,Seventh document to annotate.,blog

Combining with data_files

You can use both data_directory and data_files together:

yaml

# Load specific files first
data_files:
  - "data/important_batch.jsonl"
 
# Then load everything from the directory
data_directory: "./data/incoming"
watch_data_directory: true

Instance Updates

When a file is modified while watching is enabled:

New instances (new IDs) are added to the annotation queue
Existing instances (same IDs) are updated, but annotations are preserved
Removed instances remain in the system to preserve annotations

This means annotators won't lose their work if you update a data file.

Error Handling

Files that fail to parse are logged and skipped (other files still load)
Missing id_key in an instance: that instance is skipped with a warning
Missing text_key in an instance: instance loads with a warning
Directory permissions errors are logged

Performance Considerations

Poll interval: Higher values reduce CPU usage but delay detection of new files
Large directories: Files are scanned each interval; consider organizing files into subdirectories
Large files: Files are fully re-parsed when modified; consider using smaller batch files

Logging

The directory watcher logs its activity at INFO level:

text

INFO: Loaded 150 instances from data_directory: ./data/incoming
INFO: Directory watching enabled (poll interval: 5.0s)
INFO: Directory scan: 25 instances added, 0 updated
INFO: Directory watcher stopped

Enable DEBUG logging to see individual file processing details.

Data Directory Loading

Loading Data from a Directory

Overview

Configuration

Basic Usage (Static Loading)

Live Directory Watching

Configuration Options

Supported File Formats

How It Works

At Startup

During Runtime (when watching is enabled)

Example Directory Structure

Example Data Files

JSONL Format (batch_001.jsonl)

JSON Format (batch_002.json)

CSV Format (batch_003.csv)

Combining with data_files

Instance Updates

Error Handling

Performance Considerations

Logging

Further Reading

JSONL Format (`batch_001.jsonl`)

JSON Format (`batch_002.json`)

CSV Format (`batch_003.csv`)