远程数据源
从 URL、云存储、数据库等加载标注数据。
远程数据源
Potato 支持从本地文件以外的各种远程来源加载标注数据,包括 URL、云存储服务、数据库和 Hugging Face 数据集。
概述
数据源系统提供:
- 多种来源类型:URL、Google Drive、Dropbox、S3、Hugging Face、Google Sheets、SQL 数据库
- 部分加载:大数据集分块加载
- 增量加载:随标注进度自动加载更多数据
- 缓存:本地缓存远程文件以避免重复下载
- 安全凭证:环境变量替换敏感信息
配置
在 config.yaml 中添加 data_sources:
yaml
data_sources:
- type: file
path: "data/annotations.jsonl"
- type: url
url: "https://example.com/data.jsonl"来源类型
本地文件
yaml
data_sources:
- type: file
path: "data/annotations.jsonl"HTTP/HTTPS URL
yaml
data_sources:
- type: url
url: "https://example.com/data.jsonl"
headers:
Authorization: "Bearer ${API_TOKEN}"
max_size_mb: 100
timeout_seconds: 30
block_private_ips: true # SSRF protectionAmazon S3
yaml
data_sources:
- type: s3
bucket: "my-annotation-data"
key: "datasets/items.jsonl"
region: "us-east-1"
access_key_id: "${AWS_ACCESS_KEY_ID}"
secret_access_key: "${AWS_SECRET_ACCESS_KEY}"依赖:pip install boto3
Google Drive
yaml
data_sources:
- type: google_drive
url: "https://drive.google.com/file/d/xxx/view?usp=sharing"对于私有文件,使用 credentials_file 和服务账户。依赖:pip install google-api-python-client google-auth
Dropbox
yaml
data_sources:
- type: dropbox
url: "https://www.dropbox.com/s/xxx/file.jsonl?dl=0"对于私有文件,使用 access_token: "${DROPBOX_TOKEN}"。依赖:pip install dropbox
Hugging Face Datasets
yaml
data_sources:
- type: huggingface
dataset: "squad"
split: "train"
token: "${HF_TOKEN}" # For private datasets
id_field: "id"
text_field: "context"依赖:pip install datasets
Google Sheets
yaml
data_sources:
- type: google_sheets
spreadsheet_id: "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
sheet_name: "Sheet1"
credentials_file: "credentials/service_account.json"依赖:pip install google-api-python-client google-auth
SQL 数据库
yaml
data_sources:
- type: database
connection_string: "${DATABASE_URL}"
query: "SELECT id, text, metadata FROM items WHERE status = 'pending'"或使用单独参数:
yaml
data_sources:
- type: database
dialect: postgresql
host: "localhost"
port: 5432
database: "annotations"
username: "${DB_USER}"
password: "${DB_PASSWORD}"
table: "items"依赖:pip install sqlalchemy psycopg2-binary(PostgreSQL)或 pymysql(MySQL)
部分/增量加载
对于大数据集,启用部分加载:
yaml
partial_loading:
enabled: true
initial_count: 1000
batch_size: 500
auto_load_threshold: 0.8 # Auto-load when 80% annotated缓存
远程来源在本地缓存:
yaml
data_cache:
enabled: true
cache_dir: ".potato_cache"
ttl_seconds: 3600 # 1 hour
max_size_mb: 500凭证管理
对敏感值使用环境变量:
yaml
data_sources:
- type: url
url: "https://api.example.com/data"
headers:
Authorization: "Bearer ${API_TOKEN}"
credentials:
env_substitution: true
env_file: ".env"多数据源
组合多个来源的数据:
yaml
data_sources:
- type: file
path: "data/base.jsonl"
- type: url
url: "https://example.com/extra.jsonl"
- type: s3
bucket: "my-bucket"
key: "annotations/batch1.jsonl"向后兼容
data_files 配置继续与 data_sources 一起工作:
yaml
data_files:
- "data/existing.jsonl"
data_sources:
- type: url
url: "https://example.com/additional.jsonl"安全性
- URL 来源默认阻止私有/内部 IP 地址(SSRF 防护)
- 永远不要将凭证提交到版本控制
- 使用
${VAR_NAME}语法处理敏感信息 - 将
.env文件存储在仓库之外
延伸阅读
有关实现细节,请参阅源代码文档。