このページはまだお使いの言語に翻訳されていません。英語版を表示しています。

Datenformate

Unterstützte Datenformate und Strukturierung von Annotationsdaten.

Datenformate

Potato unterstützt mehrere Datenformate von Haus aus. Diese Anleitung erklärt, wie Sie Ihre Daten für die Annotation strukturieren.

Unterstützte Formate

Format	Erweiterung	Beschreibung
JSON	`.json`	Array von Objekten
JSON Lines	`.jsonl`	Ein JSON-Objekt pro Zeile
CSV	`.csv`	Durch Komma getrennte Werte
TSV	`.tsv`	Durch Tabulator getrennte Werte

JSON-Format

Das gängigste Format. Ihre Daten sollten ein Array von Objekten sein:

json

[
  {
    "id": "doc_001",
    "text": "This is the first document to annotate.",
    "source": "twitter",
    "date": "2024-01-15"
  },
  {
    "id": "doc_002",
    "text": "This is the second document.",
    "source": "reddit",
    "date": "2024-01-16"
  }
]

JSON-Lines-Format

Jede Zeile ist ein separates JSON-Objekt. Nützlich für große Datensätze:

jsonl

{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}

CSV/TSV-Format

Tabellarische Daten mit Kopfzeilen:

csv

id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",reddit

Konfiguration

Grundeinrichtung

Konfigurieren Sie Datendateien und Feldzuordnungen in Ihrer YAML-Datei:

yaml

data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id      # Field name for unique ID
  text_key: text  # Field name for content to annotate

Mehrere Datendateien

Mehrere Datenquellen kombinieren:

yaml

data_files:
  - "data/batch_1.json"
  - "data/batch_2.json"
  - "data/batch_3.jsonl"

Dateien werden der Reihe nach verarbeitet und zusammengeführt.

Datentypen

Einfacher Text

Einfacher Textinhalt:

json

{
  "id": "1",
  "text": "The product arrived quickly and works great!"
}

Mediendateien

Verweise auf Bilder, Videos oder Audio:

json

{
  "id": "1",
  "image_path": "images/photo_001.jpg"
}

yaml

item_properties:
  id_key: id
  image_key: image_path

Dialog/Listen

Listen werden automatisch horizontal dargestellt:

json

{
  "id": "1",
  "text": "Option A,Option B,Option C"
}

Textpaare

Für Vergleichsaufgaben:

json

{
  "id": "pair_001",
  "text": {
    "A": "Response from Model A",
    "B": "Response from Model B"
  }
}

HTML-Dateien

Verweise auf HTML-Dateien in Ordnern:

json

{
  "id": "1",
  "html_file": "html/document_001.html"
}

Kontextuelle Annotation

Kontext neben dem Haupttext einschließen:

json

{
  "id": "1",
  "text": "This is great!",
  "context": "Previous message: How do you like the new feature?"
}

yaml

item_properties:
  id_key: id
  text_key: text
  context_key: context

Anzeigeeinstellungen

Optionen für die Listendarstellung

Steuern Sie, wie Listen und Wörterbücher dargestellt werden:

yaml

list_as_text:
  # Add prefixes to items
  text_prefix: "A"  # A., B., C. (or "1" for 1., 2., 3.)
 
  # Display orientation
  horizontal: true  # Side-by-side (false for vertical)
 
  # Randomization
  randomize_values: true   # Shuffle list items
  randomize_keys: true     # Shuffle dictionary keys

HTML-Inhalt

HTML-Rendering in Text aktivieren:

yaml

html_content: true

json

{
  "id": "1",
  "text": "<p>This is <strong>formatted</strong> text.</p>"
}

Ausgabekonfiguration

Ausgabeverzeichnis

Angeben, wo Annotationen gespeichert werden:

yaml

output_annotation_dir: "output/"

Ausgabeformat

Ausgabeformat wählen:

yaml

output_annotation_format: "json"  # json, jsonl, csv, or tsv

Ausgabestruktur

Annotationen enthalten Dokument-ID und Antworten:

json

{
  "id": "doc_001",
  "user": "annotator_1",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 4
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

Spezielle Datentypen

Best-Worst Scaling

Für Rangordnungsaufgaben verwenden Sie durch Komma getrennte Elemente:

json

{
  "id": "1",
  "text": "Item A,Item B,Item C,Item D"
}

Benutzerdefinierte Argumente

Zusätzliche Felder für Anzeige oder Filterung einschließen:

json

{
  "id": "1",
  "text": "Document content",
  "category": "news",
  "priority": "high",
  "custom_field": "any value"
}

Datenbank-Backend

Für große Datensätze, verwenden Sie MySQL:

yaml

database:
  type: mysql
  host: localhost
  database: potato_db
  user: ${DB_USER}
  password: ${DB_PASSWORD}

Potato erstellt beim ersten Start automatisch die erforderlichen Tabellen.

Datenvalidierung

Potato validiert Ihre Daten beim Start:

Fehlendes ID-Feld – Alle Elemente benötigen eindeutige Bezeichner
Fehlendes Textfeld – Elemente benötigen zu annotierenden Inhalt
Doppelte IDs – Alle IDs müssen eindeutig sein
Datei nicht gefunden – Überprüfen Sie die Pfade der Datendateien

Vollständiges Beispiel

yaml

task_name: "Document Classification"
task_dir: "."
port: 8000
 
# Data configuration
data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id
  text_key: text
  context_key: metadata
 
# Display settings
list_as_text:
  text_prefix: "1"
  horizontal: false
 
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
 
# Annotation scheme
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "Select the document category"
    labels:
      - News
      - Opinion
      - Tutorial
      - Other
 
allow_all_users: true

Bewährte Vorgehensweisen

1. Aussagekräftige IDs verwenden

Erleichtert Nachverfolgung und Fehlerbehebung:

json

{"id": "twitter_2024_001", "text": "..."}

2. Text prägnant halten

Lange Texte verlangsamen die Annotation. Erwägen Sie:

Kürzung auf die wesentlichen Abschnitte
Bereitstellung von Zusammenfassungen
Verwendung von Scroll-Containern

3. Metadaten einschließen

Hilft bei der Filterung und Analyse:

json

{
  "id": "1",
  "text": "Content",
  "source": "twitter",
  "date": "2024-01-15",
  "language": "en"
}

4. Vor dem Laden validieren

Überprüfen Sie Ihre Daten offline:

python

import json
 
with open('data.json') as f:
    data = json.load(f)
 
# Check for required fields
for item in data:
    assert 'id' in item, f"Missing id: {item}"
    assert 'text' in item, f"Missing text: {item}"
 
# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
 
print(f"Validated {len(data)} items")

5. Originaldaten sichern

Halten Sie Rohdaten getrennt von Annotationen für die Reproduzierbarkeit.

6. JSON Lines für große Dateien verwenden

Speichereffizienter als JSON-Arrays:

bash

# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl

Weiterführende Informationen

Datenverzeichnis laden – Aus Verzeichnissen mit Live-Überwachung laden
Dialog-Annotation – Anzeige von Mehrelement-Daten
Exportformate – Optionen für Ausgabeformate

Implementierungsdetails finden Sie in der Quelldokumentation.