此页面尚未提供您所选语言的版本，当前显示英文版本。

Remote Data Sources

URLs, cloud storage, databases, और अधिक से annotation data लोड करें।

Remote Data Sources

Potato local files से परे विभिन्न remote sources से annotation data लोड करने का समर्थन करता है, जिसमें URLs, cloud storage services, databases, और Hugging Face datasets शामिल हैं।

अवलोकन

Data sources system प्रदान करता है:

कई source types: URLs, Google Drive, Dropbox, S3, Hugging Face, Google Sheets, SQL databases
Partial loading: बड़े datasets के लिए data को chunks में लोड करें
Incremental loading: Annotation की प्रगति के साथ स्वचालित रूप से अधिक data लोड करें
Caching: बार-बार downloads से बचने के लिए remote files को locally cache करें
Secure credentials: Secrets के लिए Environment variable substitution

कॉन्फ़िगरेशन

अपने config.yaml में data_sources जोड़ें:

yaml

data_sources:
  - type: file
    path: "data/annotations.jsonl"
 
  - type: url
    url: "https://example.com/data.jsonl"

Source Types

Local File

yaml

data_sources:
  - type: file
    path: "data/annotations.jsonl"

HTTP/HTTPS URL

yaml

data_sources:
  - type: url
    url: "https://example.com/data.jsonl"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
    max_size_mb: 100
    timeout_seconds: 30
    block_private_ips: true    # SSRF protection

Amazon S3

yaml

data_sources:
  - type: s3
    bucket: "my-annotation-data"
    key: "datasets/items.jsonl"
    region: "us-east-1"
    access_key_id: "${AWS_ACCESS_KEY_ID}"
    secret_access_key: "${AWS_SECRET_ACCESS_KEY}"

आवश्यक: pip install boto3

Google Drive

yaml

data_sources:
  - type: google_drive
    url: "https://drive.google.com/file/d/xxx/view?usp=sharing"

Private files के लिए, service account के साथ credentials_file का उपयोग करें। आवश्यक: pip install google-api-python-client google-auth

Dropbox

yaml

data_sources:
  - type: dropbox
    url: "https://www.dropbox.com/s/xxx/file.jsonl?dl=0"

Private files के लिए, access_token: "${DROPBOX_TOKEN}" का उपयोग करें। आवश्यक: pip install dropbox

Hugging Face Datasets

yaml

data_sources:
  - type: huggingface
    dataset: "squad"
    split: "train"
    token: "${HF_TOKEN}"        # For private datasets
    id_field: "id"
    text_field: "context"

आवश्यक: pip install datasets

Google Sheets

yaml

data_sources:
  - type: google_sheets
    spreadsheet_id: "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
    sheet_name: "Sheet1"
    credentials_file: "credentials/service_account.json"

आवश्यक: pip install google-api-python-client google-auth

SQL Database

yaml

data_sources:
  - type: database
    connection_string: "${DATABASE_URL}"
    query: "SELECT id, text, metadata FROM items WHERE status = 'pending'"

या individual parameters के साथ:

yaml

data_sources:
  - type: database
    dialect: postgresql
    host: "localhost"
    port: 5432
    database: "annotations"
    username: "${DB_USER}"
    password: "${DB_PASSWORD}"
    table: "items"

आवश्यक: pip install sqlalchemy psycopg2-binary (PostgreSQL) या pymysql (MySQL)

Partial/Incremental Loading

बड़े datasets के लिए, partial loading सक्षम करें:

yaml

partial_loading:
  enabled: true
  initial_count: 1000
  batch_size: 500
  auto_load_threshold: 0.8     # Auto-load when 80% annotated

Caching

Remote sources locally cache किए जाते हैं:

yaml

data_cache:
  enabled: true
  cache_dir: ".potato_cache"
  ttl_seconds: 3600            # 1 hour
  max_size_mb: 500

Credential Management

Sensitive values के लिए environment variables का उपयोग करें:

yaml

data_sources:
  - type: url
    url: "https://api.example.com/data"
    headers:
      Authorization: "Bearer ${API_TOKEN}"
 
credentials:
  env_substitution: true
  env_file: ".env"

Multiple Sources

कई sources से data मिलाएँ:

yaml

data_sources:
  - type: file
    path: "data/base.jsonl"
 
  - type: url
    url: "https://example.com/extra.jsonl"
 
  - type: s3
    bucket: "my-bucket"
    key: "annotations/batch1.jsonl"

Backward Compatibility

data_files configuration data_sources के साथ काम करना जारी रखता है:

yaml

data_files:
  - "data/existing.jsonl"
 
data_sources:
  - type: url
    url: "https://example.com/additional.jsonl"

सुरक्षा

URL sources default रूप से private/internal IP addresses block करते हैं (SSRF protection)
Credentials को version control में कभी commit न करें
Secrets के लिए ${VAR_NAME} syntax का उपयोग करें
.env files को अपने repository के बाहर रखें

आगे पढ़ें

Data Formats - Input data format reference
Admin Dashboard - Data source status की निगरानी

कार्यान्वयन विवरण के लिए, source documentation देखें।