远程数据源

从 URL、云存储、数据库等加载标注数据。

Potato 支持从本地文件以外的各种远程来源加载标注数据，包括 URL、云存储服务、数据库和 Hugging Face 数据集。

概述

数据源系统提供：

多种来源类型：URL、Google Drive、Dropbox、S3、Hugging Face、Google Sheets、SQL 数据库
部分加载：大数据集分块加载
增量加载：随标注进度自动加载更多数据
缓存：本地缓存远程文件以避免重复下载
安全凭证：环境变量替换敏感信息

配置

在 config.yaml 中添加 data_sources：

yaml

data_sources:
  - type: file
    path: "data/annotations.jsonl"
 
  - type: url
    url: "https://example.com/data.jsonl"

来源类型

本地文件

yaml

data_sources:
  - type: file
    path: "data/annotations.jsonl"

HTTP/HTTPS URL

yaml

data_sources:
  - type: url
    url: "https://example.com/data.jsonl"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
    max_size_mb: 100
    timeout_seconds: 30
    block_private_ips: true    # SSRF protection

Amazon S3

yaml

data_sources:
  - type: s3
    bucket: "my-annotation-data"
    key: "datasets/items.jsonl"
    region: "us-east-1"
    access_key_id: "${AWS_ACCESS_KEY_ID}"
    secret_access_key: "${AWS_SECRET_ACCESS_KEY}"

依赖：pip install boto3

Google Drive

yaml

data_sources:
  - type: google_drive
    url: "https://drive.google.com/file/d/xxx/view?usp=sharing"

对于私有文件，使用 credentials_file 和服务账户。依赖：pip install google-api-python-client google-auth

Dropbox

yaml

data_sources:
  - type: dropbox
    url: "https://www.dropbox.com/s/xxx/file.jsonl?dl=0"

对于私有文件，使用 access_token: "${DROPBOX_TOKEN}"。依赖：pip install dropbox

Hugging Face Datasets

yaml

data_sources:
  - type: huggingface
    dataset: "squad"
    split: "train"
    token: "${HF_TOKEN}"        # For private datasets
    id_field: "id"
    text_field: "context"

依赖：pip install datasets

Google Sheets

yaml

data_sources:
  - type: google_sheets
    spreadsheet_id: "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
    sheet_name: "Sheet1"
    credentials_file: "credentials/service_account.json"

依赖：pip install google-api-python-client google-auth

SQL 数据库

yaml

data_sources:
  - type: database
    connection_string: "${DATABASE_URL}"
    query: "SELECT id, text, metadata FROM items WHERE status = 'pending'"

或使用单独参数：

yaml

data_sources:
  - type: database
    dialect: postgresql
    host: "localhost"
    port: 5432
    database: "annotations"
    username: "${DB_USER}"
    password: "${DB_PASSWORD}"
    table: "items"

依赖：pip install sqlalchemy psycopg2-binary（PostgreSQL）或 pymysql（MySQL）

部分/增量加载

对于大数据集，启用部分加载：

yaml

partial_loading:
  enabled: true
  initial_count: 1000
  batch_size: 500
  auto_load_threshold: 0.8     # Auto-load when 80% annotated

缓存

远程来源在本地缓存：

yaml

data_cache:
  enabled: true
  cache_dir: ".potato_cache"
  ttl_seconds: 3600            # 1 hour
  max_size_mb: 500

凭证管理

对敏感值使用环境变量：

yaml

data_sources:
  - type: url
    url: "https://api.example.com/data"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
 
credentials:
  env_substitution: true
  env_file: ".env"

多数据源

组合多个来源的数据：

yaml

data_sources:
  - type: file
    path: "data/base.jsonl"
 
  - type: url
    url: "https://example.com/extra.jsonl"
 
  - type: s3
    bucket: "my-bucket"
    key: "annotations/batch1.jsonl"

向后兼容

data_files 配置继续与 data_sources 一起工作：

yaml

data_files:
  - "data/existing.jsonl"
 
data_sources:
  - type: url
    url: "https://example.com/additional.jsonl"

安全性

URL 来源默认阻止私有/内部 IP 地址（SSRF 防护）
永远不要将凭证提交到版本控制
使用 ${VAR_NAME} 语法处理敏感信息
将 .env 文件存储在仓库之外

远程数据源

概述

配置

来源类型

本地文件

HTTP/HTTPS URL

Amazon S3

Google Drive

Dropbox

Hugging Face Datasets

Google Sheets

SQL 数据库

部分/增量加载

缓存

凭证管理

多数据源

向后兼容

安全性

延伸阅读