Potato utilise les formats JSON et JSONL pour les données d'entrée et les annotations de sortie. Ce guide couvre les spécifications de format, les exemples et les bonnes pratiques pour tous les types de données.

Formats de données d'entrée

JSON Lines (JSONL) - Recommandé

Un objet JSON par ligne :

json

{"id": "001", "text": "First document text here."}
{"id": "002", "text": "Second document text here."}
{"id": "003", "text": "Third document text here."}

Avantages :

Traitement en flux (efficace en mémoire)
Facile à ajouter
Une ligne corrompue ne casse pas le fichier

Tableau JSON

Tableau JSON standard :

json

[
  {"id": "001", "text": "First document."},
  {"id": "002", "text": "Second document."},
  {"id": "003", "text": "Third document."}
]

Configuration :

yaml

data_files:
  - data/items.json

Données d'annotation de texte

Texte de base

json

{"id": "doc_001", "text": "The product quality exceeded my expectations."}

Avec métadonnées

json

{
  "id": "review_001",
  "text": "Great product, fast shipping!",
  "metadata": {
    "source": "amazon",
    "date": "2024-01-15",
    "author": "user123",
    "rating": 5
  }
}

Avec pré-annotations

json

{
  "id": "ner_001",
  "text": "Apple announced new products in Cupertino.",
  "pre_annotations": {
    "entities": [
      {"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
      {"start": 31, "end": 40, "label": "LOC", "text": "Cupertino"}
    ]
  }
}

Configuration :

yaml

data_files:
  - data/texts.json
 
item_properties:
  id_key: id
  text_key: text

Données d'annotation d'images

Images locales

json

{
  "id": "img_001",
  "image_path": "/data/images/photo_001.jpg",
  "caption": "Street scene in Paris"
}

Images distantes

json

{
  "id": "img_002",
  "image_url": "https://example.com/images/photo.jpg"
}

Avec boîtes englobantes

json

{
  "id": "detection_001",
  "image_path": "/images/street.jpg",
  "pre_annotations": {
    "objects": [
      {"bbox": [100, 150, 200, 300], "label": "person"},
      {"bbox": [350, 200, 450, 280], "label": "car"}
    ]
  }
}

Configuration :

yaml

data_files:
  - data/images.json
 
item_properties:
  id_key: id
  image_key: image_path  # or image_url

Données d'annotation audio

Audio local

json

{
  "id": "audio_001",
  "audio_path": "/data/audio/recording.wav",
  "duration": 45.5,
  "transcript": "Hello, how are you today?"
}

Avec segments

json

{
  "id": "audio_002",
  "audio_path": "/audio/meeting.mp3",
  "segments": [
    {"start": 0.0, "end": 5.5, "speaker": "Speaker1"},
    {"start": 5.5, "end": 12.0, "speaker": "Speaker2"}
  ]
}

Configuration :

yaml

data_files:
  - data/audio.json
 
item_properties:
  audio_key: audio_path
  text_key: transcript

Données multimodales

Texte + Image

json

{
  "id": "mm_001",
  "text": "What is shown in this image?",
  "image_path": "/images/scene.jpg"
}

Texte + Audio

json

{
  "id": "mm_002",
  "text": "Transcribe this audio:",
  "audio_path": "/audio/clip.wav",
  "reference_transcript": "Expected transcription here"
}

Format de sortie des annotations

Sortie de base

json

{
  "id": "doc_001",
  "text": "Great product!",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 5
  },
  "annotator": "user123",
  "timestamp": "2024-11-05T10:30:00Z"
}

Annotations par span

json

{
  "id": "ner_001",
  "text": "Apple CEO Tim Cook visited Paris.",
  "annotations": {
    "entities": [
      {"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
      {"start": 10, "end": 18, "label": "PERSON", "text": "Tim Cook"},
      {"start": 27, "end": 32, "label": "LOC", "text": "Paris"}
    ]
  }
}

Annotateurs multiples

json

{
  "id": "item_001",
  "text": "Sample text",
  "annotations": [
    {
      "annotator": "ann1",
      "labels": {"sentiment": "Positive"},
      "timestamp": "2024-11-05T10:00:00Z"
    },
    {
      "annotator": "ann2",
      "labels": {"sentiment": "Positive"},
      "timestamp": "2024-11-05T11:00:00Z"
    }
  ],
  "aggregated": {
    "sentiment": "Positive",
    "agreement": 1.0
  }
}

Référence de configuration

yaml

data_files:
  - data/items.json
 
item_properties:
  id_key: id
  text_key: text
  image_key: image_path
  audio_key: audio_path

Bonnes pratiques

Incluez toujours des identifiants : Identifiants uniques pour le suivi
Utilisez JSONL pour les grands ensembles de données : Meilleure efficacité mémoire
Validez avant le chargement : Vérifiez la syntaxe JSON
Incluez les métadonnées : Source, date, auteur facilitent le débogage
Noms de champs cohérents : Traitement en aval plus facile

Documentation complète sur les formats de données sur /docs/core-concepts/data-formats.