Remote Data Sources
Load annotation data from URLs, cloud storage, databases, and more.
Remote Data Sources
Potato supports loading annotation data from various remote sources beyond local files, including URLs, cloud storage services, databases, and Hugging Face datasets.
Overview
The data sources system provides:
- Multiple source types: URLs, Google Drive, Dropbox, S3, Hugging Face, Google Sheets, SQL databases
- Partial loading: Load data in chunks for large datasets
- Incremental loading: Auto-load more data as annotation progresses
- Caching: Cache remote files locally to avoid repeated downloads
- Secure credentials: Environment variable substitution for secrets
Configuration
Add data_sources to your config.yaml:
data_sources:
- type: file
path: "data/annotations.jsonl"
- type: url
url: "https://example.com/data.jsonl"Source Types
Local File
data_sources:
- type: file
path: "data/annotations.jsonl"HTTP/HTTPS URL
data_sources:
- type: url
url: "https://example.com/data.jsonl"
headers:
Authorization: "Bearer ${API_TOKEN}"
max_size_mb: 100
timeout_seconds: 30
block_private_ips: true # SSRF protectionAmazon S3
data_sources:
- type: s3
bucket: "my-annotation-data"
key: "datasets/items.jsonl"
region: "us-east-1"
access_key_id: "${AWS_ACCESS_KEY_ID}"
secret_access_key: "${AWS_SECRET_ACCESS_KEY}"Requires: pip install boto3
Google Drive
data_sources:
- type: google_drive
url: "https://drive.google.com/file/d/xxx/view?usp=sharing"For private files, use credentials_file with a service account. Requires: pip install google-api-python-client google-auth
Dropbox
data_sources:
- type: dropbox
url: "https://www.dropbox.com/s/xxx/file.jsonl?dl=0"For private files, use access_token: "${DROPBOX_TOKEN}". Requires: pip install dropbox
Hugging Face Datasets
data_sources:
- type: huggingface
dataset: "squad"
split: "train"
token: "${HF_TOKEN}" # For private datasets
id_field: "id"
text_field: "context"Requires: pip install datasets
Google Sheets
data_sources:
- type: google_sheets
spreadsheet_id: "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
sheet_name: "Sheet1"
credentials_file: "credentials/service_account.json"Requires: pip install google-api-python-client google-auth
SQL Database
data_sources:
- type: database
connection_string: "${DATABASE_URL}"
query: "SELECT id, text, metadata FROM items WHERE status = 'pending'"Or with individual parameters:
data_sources:
- type: database
dialect: postgresql
host: "localhost"
port: 5432
database: "annotations"
username: "${DB_USER}"
password: "${DB_PASSWORD}"
table: "items"Requires: pip install sqlalchemy psycopg2-binary (PostgreSQL) or pymysql (MySQL)
Partial/Incremental Loading
For large datasets, enable partial loading:
partial_loading:
enabled: true
initial_count: 1000
batch_size: 500
auto_load_threshold: 0.8 # Auto-load when 80% annotatedCaching
Remote sources are cached locally:
data_cache:
enabled: true
cache_dir: ".potato_cache"
ttl_seconds: 3600 # 1 hour
max_size_mb: 500Credential Management
Use environment variables for sensitive values:
data_sources:
- type: url
url: "https://api.example.com/data"
headers:
Authorization: "Bearer ${API_TOKEN}"
credentials:
env_substitution: true
env_file: ".env"Multiple Sources
Combine data from multiple sources:
data_sources:
- type: file
path: "data/base.jsonl"
- type: url
url: "https://example.com/extra.jsonl"
- type: s3
bucket: "my-bucket"
key: "annotations/batch1.jsonl"Backward Compatibility
The data_files configuration continues to work alongside data_sources:
data_files:
- "data/existing.jsonl"
data_sources:
- type: url
url: "https://example.com/additional.jsonl"Security
- URL sources block private/internal IP addresses by default (SSRF protection)
- Never commit credentials to version control
- Use
${VAR_NAME}syntax for secrets - Store
.envfiles outside your repository
Further Reading
- Data Formats - Input data format reference
- Admin Dashboard - Monitor data source status
For implementation details, see the source documentation.