Remote Data Sources

Load annotation data dynamically in Potato from HTTP URLs, S3 buckets, Google Cloud Storage, PostgreSQL databases, and HuggingFace Datasets without local files.

Potato supports loading annotation data from various remote sources beyond local files, including URLs, cloud storage services, databases, and Hugging Face datasets.

Overview

The data sources system provides:

Multiple source types: URLs, Google Drive, Dropbox, S3, Hugging Face, Google Sheets, SQL databases
Partial loading: Load data in chunks for large datasets
Incremental loading: Auto-load more data as annotation progresses
Caching: Cache remote files locally to avoid repeated downloads
Secure credentials: Environment variable substitution for secrets

Configuration

Add data_sources to your config.yaml:

yaml

data_sources:
  - type: file
    path: "data/annotations.jsonl"
 
  - type: url
    url: "https://example.com/data.jsonl"

Source Types

Local File

yaml

data_sources:
  - type: file
    path: "data/annotations.jsonl"

HTTP/HTTPS URL

yaml

data_sources:
  - type: url
    url: "https://example.com/data.jsonl"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
    max_size_mb: 100
    timeout_seconds: 30
    block_private_ips: true    # SSRF protection

Amazon S3

yaml

data_sources:
  - type: s3
    bucket: "my-annotation-data"
    key: "datasets/items.jsonl"
    region: "us-east-1"
    access_key_id: "${AWS_ACCESS_KEY_ID}"
    secret_access_key: "${AWS_SECRET_ACCESS_KEY}"

Requires: pip install boto3

Google Drive

yaml

data_sources:
  - type: google_drive
    url: "https://drive.google.com/file/d/xxx/view?usp=sharing"

For private files, use credentials_file with a service account. Requires: pip install google-api-python-client google-auth

Dropbox

yaml

data_sources:
  - type: dropbox
    url: "https://www.dropbox.com/s/xxx/file.jsonl?dl=0"

For private files, use access_token: "${DROPBOX_TOKEN}". Requires: pip install dropbox

Hugging Face Datasets

yaml

data_sources:
  - type: huggingface
    dataset: "squad"
    split: "train"
    token: "${HF_TOKEN}"        # For private datasets
    id_field: "id"
    text_field: "context"

Requires: pip install datasets

Google Sheets

yaml

data_sources:
  - type: google_sheets
    spreadsheet_id: "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
    sheet_name: "Sheet1"
    credentials_file: "credentials/service_account.json"

Requires: pip install google-api-python-client google-auth

SQL Database

yaml

data_sources:
  - type: database
    connection_string: "${DATABASE_URL}"
    query: "SELECT id, text, metadata FROM items WHERE status = 'pending'"

Or with individual parameters:

yaml

data_sources:
  - type: database
    dialect: postgresql
    host: "localhost"
    port: 5432
    database: "annotations"
    username: "${DB_USER}"
    password: "${DB_PASSWORD}"
    table: "items"

Requires: pip install sqlalchemy psycopg2-binary (PostgreSQL) or pymysql (MySQL)

Partial/Incremental Loading

For large datasets, enable partial loading:

yaml

partial_loading:
  enabled: true
  initial_count: 1000
  batch_size: 500
  auto_load_threshold: 0.8     # Auto-load when 80% annotated

Caching

Remote sources are cached locally:

yaml

data_cache:
  enabled: true
  cache_dir: ".potato_cache"
  ttl_seconds: 3600            # 1 hour
  max_size_mb: 500

Credential Management

Use environment variables for sensitive values:

yaml

data_sources:
  - type: url
    url: "https://api.example.com/data"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
 
credentials:
  env_substitution: true
  env_file: ".env"

Multiple Sources

Combine data from multiple sources:

yaml

data_sources:
  - type: file
    path: "data/base.jsonl"
 
  - type: url
    url: "https://example.com/extra.jsonl"
 
  - type: s3
    bucket: "my-bucket"
    key: "annotations/batch1.jsonl"

Backward Compatibility

The data_files configuration continues to work alongside data_sources:

yaml

data_files:
  - "data/existing.jsonl"
 
data_sources:
  - type: url
    url: "https://example.com/additional.jsonl"

Security

URL sources block private/internal IP addresses by default (SSRF protection)
Never commit credentials to version control
Use ${VAR_NAME} syntax for secrets
Store .env files outside your repository

Remote Data Sources

Overview

Configuration

Source Types

Local File

HTTP/HTTPS URL

Amazon S3

Google Drive

Dropbox

Hugging Face Datasets

Google Sheets

SQL Database

Partial/Incremental Loading

Caching

Credential Management

Multiple Sources

Backward Compatibility

Security

Further Reading