Data Directory Loading
Load annotation instances from a directory with optional live watching.
Loading Data from a Directory
Instead of specifying individual data files with data_files, you can point Potato to a directory containing your data files. All supported files will be loaded automatically, with optional live watching for new or modified files.
Overview
This is useful when:
- You have many data files and don't want to list them individually
- You want to dynamically add new files while the server is running
- You're receiving data from an external process that writes to a shared directory
Configuration
Basic Usage (Static Loading)
To load all files from a directory at startup:
# Load all supported files from this directory
data_directory: "./data/incoming"
# data_files can be empty when using data_directory
data_files: []
# Required: item_properties must still be configured
item_properties:
id_key: "id"
text_key: "text"Live Directory Watching
To automatically detect and load new or modified files while the server is running:
data_directory: "./data/incoming"
data_files: []
# Enable live watching (default: false)
watch_data_directory: true
# Optional: how often to check for changes in seconds (default: 5.0)
watch_poll_interval: 10.0
item_properties:
id_key: "id"
text_key: "text"Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
data_directory | string | - | Path to the directory containing data files |
watch_data_directory | boolean | false | Whether to watch for new/modified files |
watch_poll_interval | number | 5.0 | Seconds between directory scans (min: 1.0) |
Supported File Formats
The directory watcher supports the same formats as data_files:
- JSON (
.json) - One JSON object per line, or a JSON array - JSONL (
.jsonl) - JSON Lines format, one object per line - CSV (
.csv) - Comma-separated values with header row - TSV (
.tsv) - Tab-separated values with header row
How It Works
At Startup
- All files with supported extensions in
data_directoryare scanned - Each file is parsed according to its extension
- Instances are added to the annotation queue
- If
watch_data_directoryis enabled, a background thread starts watching
During Runtime (when watching is enabled)
- Every
watch_poll_intervalseconds, the directory is scanned - New files are parsed and their instances are added
- Modified files are re-parsed:
- New instances are added
- Existing instances are updated (annotations are preserved)
- Removed files: instances remain in the system (to preserve annotations)
Example Directory Structure
my_project/
├── configs/
│ └── config.yaml
└── data/
└── incoming/
├── batch_001.jsonl
├── batch_002.jsonl
└── new_data.json # Added while server is running
Example Data Files
JSONL Format (batch_001.jsonl)
{"id": "item_001", "text": "First document to annotate."}
{"id": "item_002", "text": "Second document to annotate."}
{"id": "item_003", "text": "Third document to annotate."}JSON Format (batch_002.json)
[
{"id": "item_004", "text": "Fourth document."},
{"id": "item_005", "text": "Fifth document."}
]CSV Format (batch_003.csv)
id,text,category
item_006,Sixth document to annotate.,news
item_007,Seventh document to annotate.,blogCombining with data_files
You can use both data_directory and data_files together:
# Load specific files first
data_files:
- "data/important_batch.jsonl"
# Then load everything from the directory
data_directory: "./data/incoming"
watch_data_directory: trueInstance Updates
When a file is modified while watching is enabled:
- New instances (new IDs) are added to the annotation queue
- Existing instances (same IDs) are updated, but annotations are preserved
- Removed instances remain in the system to preserve annotations
This means annotators won't lose their work if you update a data file.
Error Handling
- Files that fail to parse are logged and skipped (other files still load)
- Missing
id_keyin an instance: that instance is skipped with a warning - Missing
text_keyin an instance: instance loads with a warning - Directory permissions errors are logged
Performance Considerations
- Poll interval: Higher values reduce CPU usage but delay detection of new files
- Large directories: Files are scanned each interval; consider organizing files into subdirectories
- Large files: Files are fully re-parsed when modified; consider using smaller batch files
Logging
The directory watcher logs its activity at INFO level:
INFO: Loaded 150 instances from data_directory: ./data/incoming
INFO: Directory watching enabled (poll interval: 5.0s)
INFO: Directory scan: 25 instances added, 0 updated
INFO: Directory watcher stopped
Enable DEBUG logging to see individual file processing details.
Further Reading
- Data Formats - Supported data formats
- Configuration Basics - General configuration
For implementation details, see the source documentation.