数据格式

支持的数据格式及如何构建标注数据。

Potato 开箱即用地支持多种数据格式。本指南介绍如何构建您的标注数据。

支持的格式

格式	扩展名	描述
JSON	`.json`	对象数组
JSON Lines	`.jsonl`	每行一个 JSON 对象
CSV	`.csv`	逗号分隔值
TSV	`.tsv`	制表符分隔值

JSON 格式

最常用的格式。您的数据应该是一个对象数组：

json

[
  {
    "id": "doc_001",
    "text": "This is the first document to annotate.",
    "source": "twitter",
    "date": "2024-01-15"
  },
  {
    "id": "doc_002",
    "text": "This is the second document.",
    "source": "reddit",
    "date": "2024-01-16"
  }
]

JSON Lines 格式

每行是一个单独的 JSON 对象。适用于大型数据集：

jsonl

{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}

CSV/TSV 格式

带标题的表格数据：

csv

id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",reddit

配置

基本设置

在 YAML 中配置数据文件和字段映射：

yaml

data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id      # Field name for unique ID
  text_key: text  # Field name for content to annotate

多数据文件

组合多个数据源：

yaml

data_files:
  - "data/batch_1.json"
  - "data/batch_2.json"
  - "data/batch_3.jsonl"

文件按顺序处理并合并。

数据类型

纯文本

简单的文本内容：

json

{
  "id": "1",
  "text": "The product arrived quickly and works great!"
}

媒体文件

引用图像、视频或音频：

json

{
  "id": "1",
  "image_path": "images/photo_001.jpg"
}

yaml

item_properties:
  id_key: id
  image_key: image_path

对话/列表

列表会自动以水平方式显示：

json

{
  "id": "1",
  "text": "Option A,Option B,Option C"
}

文本对

用于比较任务：

json

{
  "id": "pair_001",
  "text": {
    "A": "Response from Model A",
    "B": "Response from Model B"
  }
}

HTML 文件

引用存储在文件夹中的 HTML 文件：

json

{
  "id": "1",
  "html_file": "html/document_001.html"
}

上下文标注

在主文本旁包含上下文：

json

{
  "id": "1",
  "text": "This is great!",
  "context": "Previous message: How do you like the new feature?"
}

yaml

item_properties:
  id_key: id
  text_key: text
  context_key: context

显示配置

列表显示选项

控制列表和字典的显示方式：

yaml

list_as_text:
  # Add prefixes to items
  text_prefix: "A"  # A., B., C. (or "1" for 1., 2., 3.)
 
  # Display orientation
  horizontal: true  # Side-by-side (false for vertical)
 
  # Randomization
  randomize_values: true   # Shuffle list items
  randomize_keys: true     # Shuffle dictionary keys

HTML 内容

在文本中启用 HTML 渲染：

yaml

html_content: true

json

{
  "id": "1",
  "text": "<p>This is <strong>formatted</strong> text.</p>"
}

输出配置

输出目录

指定标注保存位置：

yaml

output_annotation_dir: "output/"

输出格式

选择输出格式：

yaml

output_annotation_format: "json"  # json, jsonl, csv, or tsv

输出结构

标注包含文档 ID 和回复：

json

{
  "id": "doc_001",
  "user": "annotator_1",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 4
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

特殊数据类型

最佳-最差缩放

用于排名任务，使用逗号分隔的项目：

json

{
  "id": "1",
  "text": "Item A,Item B,Item C,Item D"
}

自定义参数

包含额外字段用于显示或筛选：

json

{
  "id": "1",
  "text": "Document content",
  "category": "news",
  "priority": "high",
  "custom_field": "any value"
}

数据库后端

对于大型数据集，使用 MySQL：

yaml

database:
  type: mysql
  host: localhost
  database: potato_db
  user: ${DB_USER}
  password: ${DB_PASSWORD}

Potato 在首次启动时自动创建所需的表。

数据验证

Potato 在启动时验证您的数据：

缺少 ID 字段 - 所有项目都需要唯一标识符
缺少文本字段 - 项目需要标注内容
重复 ID - 所有 ID 必须唯一
文件未找到 - 验证数据文件路径

完整示例

yaml

task_name: "Document Classification"
task_dir: "."
port: 8000
 
# Data configuration
data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id
  text_key: text
  context_key: metadata
 
# Display settings
list_as_text:
  text_prefix: "1"
  horizontal: false
 
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
 
# Annotation scheme
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "Select the document category"
    labels:
      - News
      - Opinion
      - Tutorial
      - Other
 
allow_all_users: true

最佳实践

1. 使用有意义的 ID

使跟踪和调试更加容易：

json

{"id": "twitter_2024_001", "text": "..."}

2. 保持文本简洁

过长的文本会减慢标注速度。考虑：

截取关键部分
提供摘要
使用滚动容器

3. 包含元数据

有助于筛选和分析：

json

{
  "id": "1",
  "text": "Content",
  "source": "twitter",
  "date": "2024-01-15",
  "language": "en"
}

4. 加载前进行验证

离线检查您的数据：

python

import json
 
with open('data.json') as f:
    data = json.load(f)
 
# Check for required fields
for item in data:
    assert 'id' in item, f"Missing id: {item}"
    assert 'text' in item, f"Missing text: {item}"
 
# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
 
print(f"Validated {len(data)} items")

5. 备份原始数据

将原始数据与标注分开保存，以确保可重复性。

6. 大文件使用 JSON Lines

比 JSON 数组更节省内存：

bash

# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl

数据格式

支持的格式

JSON 格式

JSON Lines 格式

CSV/TSV 格式

配置

基本设置

多数据文件

数据类型

纯文本

媒体文件

对话/列表

文本对

HTML 文件

上下文标注

显示配置

列表显示选项

HTML 内容

输出配置

输出目录

输出格式

输出结构

特殊数据类型

最佳-最差缩放

自定义参数

数据库后端

数据验证

完整示例

最佳实践

1. 使用有意义的 ID

2. 保持文本简洁

3. 包含元数据

4. 加载前进行验证

5. 备份原始数据

6. 大文件使用 JSON Lines

延伸阅读