从 Label Studio 迁移到 Potato
手动转换 Label Studio 项目、模板和标注到 Potato 格式的分步指南。
从 Label Studio 迁移到 Potato
本指南帮助你手动将现有的 Label Studio 项目迁移到 Potato。迁移涉及手动转换配置和编写 Python 脚本来转换数据格式。
请注意,没有官方迁移工具——这是一个手动过程,需要了解两个平台。
为什么迁移?
Potato 在某些场景下具有优势:
- 研究导向:专为学术标注研究设计
- 众包集成:原生 Prolific 和 MTurk 集成
- 简洁性:基于 YAML 的配置,无需数据库
- 可定制:易于用 Python 扩展
- 轻量级:基于文件的存储,易于部署
迁移概览
迁移是一个手动过程,包含以下步骤:
- 手动将 Label Studio XML 模板转换为 Potato YAML 配置
- 编写 Python 脚本转换数据格式(JSON 到 JSONL)
- 编写脚本迁移现有标注(如有)
- 充分测试并验证转换后的数据
模板转换
文本分类
Label Studio XML:
<View>
<Text name="text" value="$text"/>
<Choices name="sentiment" toName="text" choice="single">
<Choice value="Positive"/>
<Choice value="Negative"/>
<Choice value="Neutral"/>
</Choices>
</View>Potato YAML:
annotation_task_name: "Sentiment Classification"
data_files:
- "data/items.jsonl"
item_properties:
id_key: id
text_key: text
annotation_schemes:
- annotation_type: radio
name: sentiment
description: "What is the sentiment?"
labels:
- name: positive
tooltip: "Positive sentiment"
- name: negative
tooltip: "Negative sentiment"
- name: neutral
tooltip: "Neutral sentiment"多标签分类
Label Studio XML:
<View>
<Text name="text" value="$text"/>
<Choices name="topics" toName="text" choice="multiple">
<Choice value="Politics"/>
<Choice value="Sports"/>
<Choice value="Technology"/>
<Choice value="Entertainment"/>
</Choices>
</View>Potato YAML:
annotation_schemes:
- annotation_type: multiselect
name: topics
description: "Select all relevant topics"
labels:
- name: politics
tooltip: "Politics content"
- name: sports
tooltip: "Sports content"
- name: technology
tooltip: "Technology content"
- name: entertainment
tooltip: "Entertainment content"命名实体识别
Label Studio XML:
<View>
<Labels name="entities" toName="text">
<Label value="PERSON" background="#FFC0CB"/>
<Label value="ORG" background="#90EE90"/>
<Label value="LOCATION" background="#ADD8E6"/>
</Labels>
<Text name="text" value="$text"/>
</View>Potato YAML:
annotation_schemes:
- annotation_type: span
name: entities
description: "Select entity spans in the text"
labels:
- name: PERSON
tooltip: "Person names"
- name: ORG
tooltip: "Organization names"
- name: LOCATION
tooltip: "Location names"注意:Potato 的 span 标注可能使用与 Label Studio 不同的高亮方式。请测试转换后的配置以确认显示效果符合需求。
图像分类
Label Studio XML:
<View>
<Image name="image" value="$image_url"/>
<Choices name="category" toName="image">
<Choice value="Cat"/>
<Choice value="Dog"/>
<Choice value="Other"/>
</Choices>
</View>Potato YAML:
data_files:
- "data/images.jsonl"
item_properties:
id_key: id
text_key: image_url
annotation_schemes:
- annotation_type: radio
name: category
description: "What animal is in the image?"
labels:
- name: cat
tooltip: "Cat"
- name: dog
tooltip: "Dog"
- name: other
tooltip: "Other animal"边界框标注
Label Studio XML:
<View>
<Image name="image" value="$image_url"/>
<RectangleLabels name="objects" toName="image">
<Label value="Car"/>
<Label value="Person"/>
<Label value="Bicycle"/>
</RectangleLabels>
</View>Potato YAML:
annotation_schemes:
- annotation_type: bounding_box
name: objects
description: "Draw boxes around objects"
labels:
- name: car
tooltip: "Car"
- name: person
tooltip: "Person"
- name: bicycle
tooltip: "Bicycle"注意:Potato 中的边界框支持可能与 Label Studio 有所不同。请查看文档了解当前功能。
评分量表
Label Studio XML:
<View>
<Text name="text" value="$text"/>
<Rating name="quality" toName="text" maxRating="5"/>
</View>Potato YAML:
annotation_schemes:
- annotation_type: likert
name: quality
description: "Rate the quality"
size: 5
labels:
- name: "1"
tooltip: "Poor"
- name: "2"
tooltip: "Below average"
- name: "3"
tooltip: "Average"
- name: "4"
tooltip: "Good"
- name: "5"
tooltip: "Excellent"数据格式转换
Label Studio JSON 到 Potato JSONL
Label Studio 格式:
[
{
"id": 1,
"data": {
"text": "This is great!",
"meta_info": "source1"
}
},
{
"id": 2,
"data": {
"text": "This is terrible.",
"meta_info": "source2"
}
}
]Potato JSONL 格式:
{"id": "1", "text": "This is great!", "metadata": {"source": "source1"}}
{"id": "2", "text": "This is terrible.", "metadata": {"source": "source2"}}转换脚本
import json
def convert_label_studio_to_potato(ls_file, potato_file):
"""Convert Label Studio JSON to Potato JSONL"""
with open(ls_file, 'r') as f:
ls_data = json.load(f)
with open(potato_file, 'w') as f:
for item in ls_data:
potato_item = {
"id": str(item["id"]),
"text": item["data"].get("text", ""),
}
# Convert nested data fields
if "data" in item:
for key, value in item["data"].items():
if key != "text":
if "metadata" not in potato_item:
potato_item["metadata"] = {}
potato_item["metadata"][key] = value
# Handle image URLs
if "image" in item.get("data", {}):
potato_item["image_url"] = item["data"]["image"]
f.write(json.dumps(potato_item) + "\n")
print(f"Converted {len(ls_data)} items")
# Usage
convert_label_studio_to_potato("label_studio_export.json", "data/items.jsonl")标注迁移
转换现有标注
def convert_annotations(ls_export, potato_output):
"""Convert Label Studio annotations to Potato format"""
with open(ls_export, 'r') as f:
ls_data = json.load(f)
with open(potato_output, 'w') as f:
for item in ls_data:
if "annotations" not in item or not item["annotations"]:
continue
for annotation in item["annotations"]:
potato_ann = {
"id": str(item["id"]),
"text": item["data"].get("text", ""),
"annotations": {},
"annotator": annotation.get("completed_by", {}).get("email", "unknown"),
"timestamp": annotation.get("created_at", "")
}
# Convert results
for result in annotation.get("result", []):
scheme_name = result.get("from_name", "unknown")
if result["type"] == "choices":
# Classification
potato_ann["annotations"][scheme_name] = result["value"]["choices"][0]
elif result["type"] == "labels":
# NER spans
if scheme_name not in potato_ann["annotations"]:
potato_ann["annotations"][scheme_name] = []
potato_ann["annotations"][scheme_name].append({
"start": result["value"]["start"],
"end": result["value"]["end"],
"label": result["value"]["labels"][0],
"text": result["value"]["text"]
})
elif result["type"] == "rating":
potato_ann["annotations"][scheme_name] = result["value"]["rating"]
f.write(json.dumps(potato_ann) + "\n")
# Usage
convert_annotations("ls_annotated_export.json", "annotations/migrated.jsonl")Span 标注转换
Label Studio 使用字符偏移;Potato 也使用字符偏移,所以转换很直接:
def convert_spans(ls_spans):
"""Convert Label Studio span format to Potato format"""
potato_spans = []
for span in ls_spans:
potato_spans.append({
"start": span["value"]["start"],
"end": span["value"]["end"],
"label": span["value"]["labels"][0],
"text": span["value"]["text"]
})
return potato_spans功能映射
| Label Studio | Potato |
|---|---|
| Choices (single) | radio |
| Choices (multiple) | multiselect |
| Labels | span |
| Rating | likert |
| TextArea | text |
| RectangleLabels | bounding_box |
| PolygonLabels | polygon |
| Taxonomy | (使用嵌套 multiselect) |
| Pairwise | comparison |
质量控制注意事项
从 Label Studio 迁移时,你需要手动实现质量控制措施。Potato 提供了一些基本的质量控制功能,但没有全面的内置质量控制系统。
质量控制方法
注意力检测:你可以手动将注意力检测项目添加到数据文件中。这些是包含已知正确答案的普通项目,用于验证标注者的注意力:
# Add attention check items to your data
attention_items = [
{"id": "attn_1", "text": "ATTENTION CHECK: Please select 'Positive'", "is_attention": True},
{"id": "attn_2", "text": "ATTENTION CHECK: Please select 'Negative'", "is_attention": True},
]
# Intersperse with regular items
import random
all_items = regular_items + attention_items
random.shuffle(all_items)一致性计算:使用收集的标注数据离线计算标注者间一致性:
from sklearn.metrics import cohen_kappa_score
import numpy as np
def compute_agreement(annotations_file):
"""Compute agreement from collected annotations"""
# Load annotations and compute metrics externally
# Potato does not have built-in agreement calculation
pass冗余标注:通过在数据管理过程中将项目分配给多个用户来配置每个项目的多个标注者。
用户迁移
从 Label Studio 导出用户
# Label Studio API call to get users
import requests
def export_ls_users(ls_url, api_key):
response = requests.get(
f"{ls_url}/api/users",
headers={"Authorization": f"Token {api_key}"}
)
return response.json()创建 Potato 用户配置
user_config:
# Simple auth for migrated users
auth_type: password
users:
- username: user1@example.com
password_hash: "..." # Generate new passwords
- username: user2@example.com
password_hash: "..."测试迁移
验证脚本
def validate_migration(original_ls, converted_potato):
"""Validate converted data matches original"""
with open(original_ls) as f:
ls_data = json.load(f)
with open(converted_potato) as f:
potato_data = [json.loads(line) for line in f]
# Check item count
assert len(ls_data) == len(potato_data), "Item count mismatch"
# Check IDs preserved
ls_ids = {str(item["id"]) for item in ls_data}
potato_ids = {item["id"] for item in potato_data}
assert ls_ids == potato_ids, "ID mismatch"
# Check text content
for ls_item, potato_item in zip(
sorted(ls_data, key=lambda x: x["id"]),
sorted(potato_data, key=lambda x: x["id"])
):
assert ls_item["data"]["text"] == potato_item["text"], \
f"Text mismatch for item {ls_item['id']}"
print("Validation passed!")
validate_migration("label_studio_export.json", "data/items.jsonl")迁移检查清单
- 从 Label Studio 导出数据(JSON 格式)
- 手动将模板 XML 转换为 Potato YAML
- 编写并运行 Python 脚本转换数据格式(JSON 到 JSONL)
- 编写并运行脚本转换现有标注(如有)
- 设置 Potato 项目结构
- 使用示例数据测试
- 验证转换后的数据与原始数据匹配
- 培训标注者使用新界面
- 运行试标注批次
常见问题
字符编码
Label Studio 和 Potato 都使用 UTF-8,但请检查数据中的编码问题。
图像路径
将本地路径转换为 URL,或更新路径以匹配 Potato 的预期格式。
自定义组件
Label Studio 自定义组件需要重新创建为 Potato 自定义模板。
API 差异
如果你自动化了 Label Studio,请更新脚本以使用 Potato 的 API。