データ形式

サポートされるデータ形式とアノテーションデータの構造化方法。

Potatoは複数のデータ形式をそのままサポートしています。このガイドでは、アノテーション用のデータ構造化方法を説明します。

サポートされる形式

形式	拡張子	説明
JSON	`.json`	オブジェクトの配列
JSON Lines	`.jsonl`	1行に1つのJSONオブジェクト
CSV	`.csv`	カンマ区切り値
TSV	`.tsv`	タブ区切り値

JSON形式

最も一般的な形式です。データはオブジェクトの配列として構成します：

json

[
  {
    "id": "doc_001",
    "text": "This is the first document to annotate.",
    "source": "twitter",
    "date": "2024-01-15"
  },
  {
    "id": "doc_002",
    "text": "This is the second document.",
    "source": "reddit",
    "date": "2024-01-16"
  }
]

JSON Lines形式

各行が独立したJSONオブジェクトです。大規模データセットに適しています：

jsonl

{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}

CSV/TSV形式

ヘッダー付きの表形式データ：

csv

id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",reddit

設定

基本セットアップ

YAMLでデータファイルとフィールドマッピングを設定：

yaml

data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id      # Field name for unique ID
  text_key: text  # Field name for content to annotate

複数のデータファイル

複数のデータソースを組み合わせる：

yaml

data_files:
  - "data/batch_1.json"
  - "data/batch_2.json"
  - "data/batch_3.jsonl"

ファイルは順番に処理され、結合されます。

データタイプ

プレーンテキスト

シンプルなテキストコンテンツ：

json

{
  "id": "1",
  "text": "The product arrived quickly and works great!"
}

メディアファイル

画像、ビデオ、またはオーディオの参照：

json

{
  "id": "1",
  "image_path": "images/photo_001.jpg"
}

yaml

item_properties:
  id_key: id
  image_key: image_path

対話/リスト

リストは自動的に横並びで表示されます：

json

{
  "id": "1",
  "text": "Option A,Option B,Option C"
}

テキストペア

比較タスク用：

json

{
  "id": "pair_001",
  "text": {
    "A": "Response from Model A",
    "B": "Response from Model B"
  }
}

HTMLファイル

フォルダに保存されたHTMLファイルの参照：

json

{
  "id": "1",
  "html_file": "html/document_001.html"
}

コンテキスト付きアノテーション

メインテキストとともにコンテキストを含める：

json

{
  "id": "1",
  "text": "This is great!",
  "context": "Previous message: How do you like the new feature?"
}

yaml

item_properties:
  id_key: id
  text_key: text
  context_key: context

表示設定

リスト表示オプション

リストと辞書の表示方法を制御：

yaml

list_as_text:
  # Add prefixes to items
  text_prefix: "A"  # A., B., C. (or "1" for 1., 2., 3.)
 
  # Display orientation
  horizontal: true  # Side-by-side (false for vertical)
 
  # Randomization
  randomize_values: true   # Shuffle list items
  randomize_keys: true     # Shuffle dictionary keys

HTMLコンテンツ

テキスト内のHTMLレンダリングを有効にする：

yaml

html_content: true

json

{
  "id": "1",
  "text": "<p>This is <strong>formatted</strong> text.</p>"
}

出力設定

出力ディレクトリ

アノテーションの保存先を指定：

yaml

output_annotation_dir: "output/"

出力形式

出力形式を選択：

yaml

output_annotation_format: "json"  # json, jsonl, csv, or tsv

出力構造

アノテーションにはドキュメントIDと回答が含まれます：

json

{
  "id": "doc_001",
  "user": "annotator_1",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 4
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

特殊データタイプ

ベスト・ワーストスケーリング

ランキングタスクでは、カンマ区切りのアイテムを使用：

json

{
  "id": "1",
  "text": "Item A,Item B,Item C,Item D"
}

カスタム引数

表示やフィルタリング用の追加フィールドを含める：

json

{
  "id": "1",
  "text": "Document content",
  "category": "news",
  "priority": "high",
  "custom_field": "any value"
}

データベースバックエンド

大規模データセットにはMySQLを使用：

yaml

database:
  type: mysql
  host: localhost
  database: potato_db
  user: ${DB_USER}
  password: ${DB_PASSWORD}

Potatoは初回起動時に必要なテーブルを自動的に作成します。

データ検証

Potatoは起動時にデータを検証します：

IDフィールドの欠落 - すべてのアイテムに一意の識別子が必要
テキストフィールドの欠落 - アイテムにはアノテーション対象のコンテンツが必要
重複ID - すべてのIDが一意である必要がある
ファイルが見つからない - データファイルのパスを確認

完全な例

yaml

task_name: "Document Classification"
task_dir: "."
port: 8000
 
# Data configuration
data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id
  text_key: text
  context_key: metadata
 
# Display settings
list_as_text:
  text_prefix: "1"
  horizontal: false
 
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
 
# Annotation scheme
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "Select the document category"
    labels:
      - News
      - Opinion
      - Tutorial
      - Other
 
allow_all_users: true

ベストプラクティス

1. 意味のあるIDを使用

追跡とデバッグを容易にします：

json

{"id": "twitter_2024_001", "text": "..."}

2. テキストを簡潔に

長いテキストはアノテーションを遅くします。検討事項：

重要な部分に切り詰める
要約を提供する
スクロールコンテナを使用する

3. メタデータを含める

フィルタリングと分析に役立ちます：

json

{
  "id": "1",
  "text": "Content",
  "source": "twitter",
  "date": "2024-01-15",
  "language": "en"
}

4. 読み込み前に検証

オフラインでデータを確認：

python

import json
 
with open('data.json') as f:
    data = json.load(f)
 
# Check for required fields
for item in data:
    assert 'id' in item, f"Missing id: {item}"
    assert 'text' in item, f"Missing text: {item}"
 
# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
 
print(f"Validated {len(data)} items")

5. 元データをバックアップ

再現性のために、生データをアノテーションとは別に保管してください。

6. 大きなファイルにはJSON Linesを使用

JSON配列よりメモリ効率が良い：

bash

# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl

データ形式

サポートされる形式

JSON形式

JSON Lines形式

CSV/TSV形式

設定

基本セットアップ

複数のデータファイル

データタイプ

プレーンテキスト

メディアファイル

対話/リスト

テキストペア

HTMLファイル

コンテキスト付きアノテーション

表示設定

リスト表示オプション

HTMLコンテンツ

出力設定

出力ディレクトリ

出力形式

出力構造

特殊データタイプ

ベスト・ワーストスケーリング

カスタム引数

データベースバックエンド

データ検証

完全な例

ベストプラクティス

1. 意味のあるIDを使用

2. テキストを簡潔に

3. メタデータを含める

4. 読み込み前に検証

5. 元データをバックアップ

6. 大きなファイルにはJSON Linesを使用

関連資料