Skip to content
Docs/Features

Remote Data Sources

Load annotation data from URLs, cloud storage, databases, and more.

Remote Data Sources

Potato supports loading annotation data from various remote sources beyond local files, including URLs, cloud storage services, databases, and Hugging Face datasets.

Overview

The data sources system provides:

  • Multiple source types: URLs, Google Drive, Dropbox, S3, Hugging Face, Google Sheets, SQL databases
  • Partial loading: Load data in chunks for large datasets
  • Incremental loading: Auto-load more data as annotation progresses
  • Caching: Cache remote files locally to avoid repeated downloads
  • Secure credentials: Environment variable substitution for secrets

Configuration

Add data_sources to your config.yaml:

yaml
data_sources:
  - type: file
    path: "data/annotations.jsonl"
 
  - type: url
    url: "https://example.com/data.jsonl"

Source Types

Local File

yaml
data_sources:
  - type: file
    path: "data/annotations.jsonl"

HTTP/HTTPS URL

yaml
data_sources:
  - type: url
    url: "https://example.com/data.jsonl"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
    max_size_mb: 100
    timeout_seconds: 30
    block_private_ips: true    # SSRF protection

Amazon S3

yaml
data_sources:
  - type: s3
    bucket: "my-annotation-data"
    key: "datasets/items.jsonl"
    region: "us-east-1"
    access_key_id: "${AWS_ACCESS_KEY_ID}"
    secret_access_key: "${AWS_SECRET_ACCESS_KEY}"

Requires: pip install boto3

Google Drive

yaml
data_sources:
  - type: google_drive
    url: "https://drive.google.com/file/d/xxx/view?usp=sharing"

For private files, use credentials_file with a service account. Requires: pip install google-api-python-client google-auth

Dropbox

yaml
data_sources:
  - type: dropbox
    url: "https://www.dropbox.com/s/xxx/file.jsonl?dl=0"

For private files, use access_token: "${DROPBOX_TOKEN}". Requires: pip install dropbox

Hugging Face Datasets

yaml
data_sources:
  - type: huggingface
    dataset: "squad"
    split: "train"
    token: "${HF_TOKEN}"        # For private datasets
    id_field: "id"
    text_field: "context"

Requires: pip install datasets

Google Sheets

yaml
data_sources:
  - type: google_sheets
    spreadsheet_id: "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
    sheet_name: "Sheet1"
    credentials_file: "credentials/service_account.json"

Requires: pip install google-api-python-client google-auth

SQL Database

yaml
data_sources:
  - type: database
    connection_string: "${DATABASE_URL}"
    query: "SELECT id, text, metadata FROM items WHERE status = 'pending'"

Or with individual parameters:

yaml
data_sources:
  - type: database
    dialect: postgresql
    host: "localhost"
    port: 5432
    database: "annotations"
    username: "${DB_USER}"
    password: "${DB_PASSWORD}"
    table: "items"

Requires: pip install sqlalchemy psycopg2-binary (PostgreSQL) or pymysql (MySQL)

Partial/Incremental Loading

For large datasets, enable partial loading:

yaml
partial_loading:
  enabled: true
  initial_count: 1000
  batch_size: 500
  auto_load_threshold: 0.8     # Auto-load when 80% annotated

Caching

Remote sources are cached locally:

yaml
data_cache:
  enabled: true
  cache_dir: ".potato_cache"
  ttl_seconds: 3600            # 1 hour
  max_size_mb: 500

Credential Management

Use environment variables for sensitive values:

yaml
data_sources:
  - type: url
    url: "https://api.example.com/data"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
 
credentials:
  env_substitution: true
  env_file: ".env"

Multiple Sources

Combine data from multiple sources:

yaml
data_sources:
  - type: file
    path: "data/base.jsonl"
 
  - type: url
    url: "https://example.com/extra.jsonl"
 
  - type: s3
    bucket: "my-bucket"
    key: "annotations/batch1.jsonl"

Backward Compatibility

The data_files configuration continues to work alongside data_sources:

yaml
data_files:
  - "data/existing.jsonl"
 
data_sources:
  - type: url
    url: "https://example.com/additional.jsonl"

Security

  • URL sources block private/internal IP addresses by default (SSRF protection)
  • Never commit credentials to version control
  • Use ${VAR_NAME} syntax for secrets
  • Store .env files outside your repository

Further Reading

For implementation details, see the source documentation.