Guides3 min read
理解 Potato 数据格式
深入介绍 JSON 和 JSONL 数据格式,包含文本、图像、音频和多模态标注的示例。
Potato Team·
理解 Potato 数据格式
Potato 使用 JSON 和 JSONL 格式处理输入数据和输出标注。本指南涵盖所有数据类型的格式规范、示例和最佳实践。
输入数据格式
JSON Lines (JSONL) - 推荐
每行一个 JSON 对象:
json
{"id": "001", "text": "First document text here."}
{"id": "002", "text": "Second document text here."}
{"id": "003", "text": "Third document text here."}优势:
- 流式处理(内存高效)
- 便于追加数据
- 单行损坏不会影响整个文件
JSON 数组
标准 JSON 数组:
json
[
{"id": "001", "text": "First document."},
{"id": "002", "text": "Second document."},
{"id": "003", "text": "Third document."}
]配置:
yaml
data_files:
- data/items.json文本标注数据
基本文本
json
{"id": "doc_001", "text": "The product quality exceeded my expectations."}带元数据
json
{
"id": "review_001",
"text": "Great product, fast shipping!",
"metadata": {
"source": "amazon",
"date": "2024-01-15",
"author": "user123",
"rating": 5
}
}带预标注
json
{
"id": "ner_001",
"text": "Apple announced new products in Cupertino.",
"pre_annotations": {
"entities": [
{"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
{"start": 31, "end": 40, "label": "LOC", "text": "Cupertino"}
]
}
}配置:
yaml
data_files:
- data/texts.json
item_properties:
id_key: id
text_key: text图像标注数据
本地图像
json
{
"id": "img_001",
"image_path": "/data/images/photo_001.jpg",
"caption": "Street scene in Paris"
}远程图像
json
{
"id": "img_002",
"image_url": "https://example.com/images/photo.jpg"
}带边界框
json
{
"id": "detection_001",
"image_path": "/images/street.jpg",
"pre_annotations": {
"objects": [
{"bbox": [100, 150, 200, 300], "label": "person"},
{"bbox": [350, 200, 450, 280], "label": "car"}
]
}
}配置:
yaml
data_files:
- data/images.json
item_properties:
id_key: id
image_key: image_path # or image_url音频标注数据
本地音频
json
{
"id": "audio_001",
"audio_path": "/data/audio/recording.wav",
"duration": 45.5,
"transcript": "Hello, how are you today?"
}带分段
json
{
"id": "audio_002",
"audio_path": "/audio/meeting.mp3",
"segments": [
{"start": 0.0, "end": 5.5, "speaker": "Speaker1"},
{"start": 5.5, "end": 12.0, "speaker": "Speaker2"}
]
}配置:
yaml
data_files:
- data/audio.json
item_properties:
audio_key: audio_path
text_key: transcript多模态数据
文本 + 图像
json
{
"id": "mm_001",
"text": "What is shown in this image?",
"image_path": "/images/scene.jpg"
}文本 + 音频
json
{
"id": "mm_002",
"text": "Transcribe this audio:",
"audio_path": "/audio/clip.wav",
"reference_transcript": "Expected transcription here"
}输出标注格式
基本输出
json
{
"id": "doc_001",
"text": "Great product!",
"annotations": {
"sentiment": "Positive",
"confidence": 5
},
"annotator": "user123",
"timestamp": "2024-11-05T10:30:00Z"
}Span 标注
json
{
"id": "ner_001",
"text": "Apple CEO Tim Cook visited Paris.",
"annotations": {
"entities": [
{"start": 0, "end": 5, "label": "ORG", "text": "Apple"},
{"start": 10, "end": 18, "label": "PERSON", "text": "Tim Cook"},
{"start": 27, "end": 32, "label": "LOC", "text": "Paris"}
]
}
}多标注者
json
{
"id": "item_001",
"text": "Sample text",
"annotations": [
{
"annotator": "ann1",
"labels": {"sentiment": "Positive"},
"timestamp": "2024-11-05T10:00:00Z"
},
{
"annotator": "ann2",
"labels": {"sentiment": "Positive"},
"timestamp": "2024-11-05T11:00:00Z"
}
],
"aggregated": {
"sentiment": "Positive",
"agreement": 1.0
}
}配置参考
yaml
data_files:
- data/items.json
item_properties:
id_key: id
text_key: text
image_key: image_path
audio_key: audio_path最佳实践
- 始终包含 ID:用于追踪的唯一标识符
- 大数据集使用 JSONL:更好的内存效率
- 加载前验证:检查 JSON 语法
- 包含元数据:来源、日期、作者有助于调试
- 一致的字段名:方便下游处理
完整的数据格式文档请参阅 /docs/core-concepts/data-formats。