# Remote Data Sources

Source: https://www.potatoannotator.com/docs/features/remote-data-sources

Potato supports loading annotation data from various remote sources beyond local files, including URLs, cloud storage services, databases, and Hugging Face datasets.

## Overview

The data sources system provides:

- **Multiple source types**: URLs, Google Drive, Dropbox, S3, Hugging Face, Google Sheets, SQL databases
- **Partial loading**: Load data in chunks for large datasets
- **Incremental loading**: Auto-load more data as annotation progresses
- **Caching**: Cache remote files locally to avoid repeated downloads
- **Secure credentials**: Environment variable substitution for secrets

## Configuration

Add `data_sources` to your config.yaml:

```yaml
data_sources:
  - type: file
    path: "data/annotations.jsonl"

  - type: url
    url: "https://example.com/data.jsonl"
```

## Source Types

### Local File

```yaml
data_sources:
  - type: file
    path: "data/annotations.jsonl"
```

### HTTP/HTTPS URL

```yaml
data_sources:
  - type: url
    url: "https://example.com/data.jsonl"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
    max_size_mb: 100
    timeout_seconds: 30
    block_private_ips: true    # SSRF protection
```

### Amazon S3

```yaml
data_sources:
  - type: s3
    bucket: "my-annotation-data"
    key: "datasets/items.jsonl"
    region: "us-east-1"
    access_key_id: "${AWS_ACCESS_KEY_ID}"
    secret_access_key: "${AWS_SECRET_ACCESS_KEY}"
```

Requires: `pip install boto3`

### Google Drive

```yaml
data_sources:
  - type: google_drive
    url: "https://drive.google.com/file/d/xxx/view?usp=sharing"
```

For private files, use `credentials_file` with a service account. Requires: `pip install google-api-python-client google-auth`

### Dropbox

```yaml
data_sources:
  - type: dropbox
    url: "https://www.dropbox.com/s/xxx/file.jsonl?dl=0"
```

For private files, use `access_token: "${DROPBOX_TOKEN}"`. Requires: `pip install dropbox`

### Hugging Face Datasets

```yaml
data_sources:
  - type: huggingface
    dataset: "squad"
    split: "train"
    token: "${HF_TOKEN}"        # For private datasets
    id_field: "id"
    text_field: "context"
```

Requires: `pip install datasets`

### Google Sheets

```yaml
data_sources:
  - type: google_sheets
    spreadsheet_id: "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
    sheet_name: "Sheet1"
    credentials_file: "credentials/service_account.json"
```

Requires: `pip install google-api-python-client google-auth`

### SQL Database

```yaml
data_sources:
  - type: database
    connection_string: "${DATABASE_URL}"
    query: "SELECT id, text, metadata FROM items WHERE status = 'pending'"
```

Or with individual parameters:

```yaml
data_sources:
  - type: database
    dialect: postgresql
    host: "localhost"
    port: 5432
    database: "annotations"
    username: "${DB_USER}"
    password: "${DB_PASSWORD}"
    table: "items"
```

Requires: `pip install sqlalchemy psycopg2-binary` (PostgreSQL) or `pymysql` (MySQL)

## Partial/Incremental Loading

For large datasets, enable partial loading:

```yaml
partial_loading:
  enabled: true
  initial_count: 1000
  batch_size: 500
  auto_load_threshold: 0.8     # Auto-load when 80% annotated
```

## Caching

Remote sources are cached locally:

```yaml
data_cache:
  enabled: true
  cache_dir: ".potato_cache"
  ttl_seconds: 3600            # 1 hour
  max_size_mb: 500
```

## Credential Management

Use environment variables for sensitive values:

```yaml
data_sources:
  - type: url
    url: "https://api.example.com/data"
    headers:
      Authorization: "Bearer ${API_TOKEN}"

credentials:
  env_substitution: true
  env_file: ".env"
```

## Multiple Sources

Combine data from multiple sources:

```yaml
data_sources:
  - type: file
    path: "data/base.jsonl"

  - type: url
    url: "https://example.com/extra.jsonl"

  - type: s3
    bucket: "my-bucket"
    key: "annotations/batch1.jsonl"
```

## Backward Compatibility

The `data_files` configuration continues to work alongside `data_sources`:

```yaml
data_files:
  - "data/existing.jsonl"

data_sources:
  - type: url
    url: "https://example.com/additional.jsonl"
```

## Security

- URL sources block private/internal IP addresses by default (SSRF protection)
- Never commit credentials to version control
- Use `${VAR_NAME}` syntax for secrets
- Store `.env` files outside your repository

## Further Reading

- [Data Formats](/docs/core-concepts/data-formats) - Input data format reference
- [Admin Dashboard](/docs/features/admin-dashboard) - Monitor data source status

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/remote_data_sources.md).
