Skip to content
Guides6 min read

从 Label Studio 迁移到 Potato

手动转换 Label Studio 项目、模板和标注到 Potato 格式的分步指南。

Potato Team·

从 Label Studio 迁移到 Potato

本指南帮助你手动将现有的 Label Studio 项目迁移到 Potato。迁移涉及手动转换配置和编写 Python 脚本来转换数据格式。

请注意,没有官方迁移工具——这是一个手动过程,需要了解两个平台。

为什么迁移?

Potato 在某些场景下具有优势:

  • 研究导向:专为学术标注研究设计
  • 众包集成:原生 Prolific 和 MTurk 集成
  • 简洁性:基于 YAML 的配置,无需数据库
  • 可定制:易于用 Python 扩展
  • 轻量级:基于文件的存储,易于部署

迁移概览

迁移是一个手动过程,包含以下步骤:

  1. 手动将 Label Studio XML 模板转换为 Potato YAML 配置
  2. 编写 Python 脚本转换数据格式(JSON 到 JSONL)
  3. 编写脚本迁移现有标注(如有)
  4. 充分测试并验证转换后的数据

模板转换

文本分类

Label Studio XML:

xml
<View>
  <Text name="text" value="$text"/>
  <Choices name="sentiment" toName="text" choice="single">
    <Choice value="Positive"/>
    <Choice value="Negative"/>
    <Choice value="Neutral"/>
  </Choices>
</View>

Potato YAML:

yaml
annotation_task_name: "Sentiment Classification"
 
data_files:
  - "data/items.jsonl"
 
item_properties:
  id_key: id
  text_key: text
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the sentiment?"
    labels:
      - name: positive
        tooltip: "Positive sentiment"
      - name: negative
        tooltip: "Negative sentiment"
      - name: neutral
        tooltip: "Neutral sentiment"

多标签分类

Label Studio XML:

xml
<View>
  <Text name="text" value="$text"/>
  <Choices name="topics" toName="text" choice="multiple">
    <Choice value="Politics"/>
    <Choice value="Sports"/>
    <Choice value="Technology"/>
    <Choice value="Entertainment"/>
  </Choices>
</View>

Potato YAML:

yaml
annotation_schemes:
  - annotation_type: multiselect
    name: topics
    description: "Select all relevant topics"
    labels:
      - name: politics
        tooltip: "Politics content"
      - name: sports
        tooltip: "Sports content"
      - name: technology
        tooltip: "Technology content"
      - name: entertainment
        tooltip: "Entertainment content"

命名实体识别

Label Studio XML:

xml
<View>
  <Labels name="entities" toName="text">
    <Label value="PERSON" background="#FFC0CB"/>
    <Label value="ORG" background="#90EE90"/>
    <Label value="LOCATION" background="#ADD8E6"/>
  </Labels>
  <Text name="text" value="$text"/>
</View>

Potato YAML:

yaml
annotation_schemes:
  - annotation_type: span
    name: entities
    description: "Select entity spans in the text"
    labels:
      - name: PERSON
        tooltip: "Person names"
      - name: ORG
        tooltip: "Organization names"
      - name: LOCATION
        tooltip: "Location names"

注意:Potato 的 span 标注可能使用与 Label Studio 不同的高亮方式。请测试转换后的配置以确认显示效果符合需求。

图像分类

Label Studio XML:

xml
<View>
  <Image name="image" value="$image_url"/>
  <Choices name="category" toName="image">
    <Choice value="Cat"/>
    <Choice value="Dog"/>
    <Choice value="Other"/>
  </Choices>
</View>

Potato YAML:

yaml
data_files:
  - "data/images.jsonl"
 
item_properties:
  id_key: id
  text_key: image_url
 
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "What animal is in the image?"
    labels:
      - name: cat
        tooltip: "Cat"
      - name: dog
        tooltip: "Dog"
      - name: other
        tooltip: "Other animal"

边界框标注

Label Studio XML:

xml
<View>
  <Image name="image" value="$image_url"/>
  <RectangleLabels name="objects" toName="image">
    <Label value="Car"/>
    <Label value="Person"/>
    <Label value="Bicycle"/>
  </RectangleLabels>
</View>

Potato YAML:

yaml
annotation_schemes:
  - annotation_type: bounding_box
    name: objects
    description: "Draw boxes around objects"
    labels:
      - name: car
        tooltip: "Car"
      - name: person
        tooltip: "Person"
      - name: bicycle
        tooltip: "Bicycle"

注意:Potato 中的边界框支持可能与 Label Studio 有所不同。请查看文档了解当前功能。

评分量表

Label Studio XML:

xml
<View>
  <Text name="text" value="$text"/>
  <Rating name="quality" toName="text" maxRating="5"/>
</View>

Potato YAML:

yaml
annotation_schemes:
  - annotation_type: likert
    name: quality
    description: "Rate the quality"
    size: 5
    labels:
      - name: "1"
        tooltip: "Poor"
      - name: "2"
        tooltip: "Below average"
      - name: "3"
        tooltip: "Average"
      - name: "4"
        tooltip: "Good"
      - name: "5"
        tooltip: "Excellent"

数据格式转换

Label Studio JSON 到 Potato JSONL

Label Studio 格式:

json
[
  {
    "id": 1,
    "data": {
      "text": "This is great!",
      "meta_info": "source1"
    }
  },
  {
    "id": 2,
    "data": {
      "text": "This is terrible.",
      "meta_info": "source2"
    }
  }
]

Potato JSONL 格式:

json
{"id": "1", "text": "This is great!", "metadata": {"source": "source1"}}
{"id": "2", "text": "This is terrible.", "metadata": {"source": "source2"}}

转换脚本

python
import json
 
def convert_label_studio_to_potato(ls_file, potato_file):
    """Convert Label Studio JSON to Potato JSONL"""
 
    with open(ls_file, 'r') as f:
        ls_data = json.load(f)
 
    with open(potato_file, 'w') as f:
        for item in ls_data:
            potato_item = {
                "id": str(item["id"]),
                "text": item["data"].get("text", ""),
            }
 
            # Convert nested data fields
            if "data" in item:
                for key, value in item["data"].items():
                    if key != "text":
                        if "metadata" not in potato_item:
                            potato_item["metadata"] = {}
                        potato_item["metadata"][key] = value
 
            # Handle image URLs
            if "image" in item.get("data", {}):
                potato_item["image_url"] = item["data"]["image"]
 
            f.write(json.dumps(potato_item) + "\n")
 
    print(f"Converted {len(ls_data)} items")
 
# Usage
convert_label_studio_to_potato("label_studio_export.json", "data/items.jsonl")

标注迁移

转换现有标注

python
def convert_annotations(ls_export, potato_output):
    """Convert Label Studio annotations to Potato format"""
 
    with open(ls_export, 'r') as f:
        ls_data = json.load(f)
 
    with open(potato_output, 'w') as f:
        for item in ls_data:
            if "annotations" not in item or not item["annotations"]:
                continue
 
            for annotation in item["annotations"]:
                potato_ann = {
                    "id": str(item["id"]),
                    "text": item["data"].get("text", ""),
                    "annotations": {},
                    "annotator": annotation.get("completed_by", {}).get("email", "unknown"),
                    "timestamp": annotation.get("created_at", "")
                }
 
                # Convert results
                for result in annotation.get("result", []):
                    scheme_name = result.get("from_name", "unknown")
 
                    if result["type"] == "choices":
                        # Classification
                        potato_ann["annotations"][scheme_name] = result["value"]["choices"][0]
 
                    elif result["type"] == "labels":
                        # NER spans
                        if scheme_name not in potato_ann["annotations"]:
                            potato_ann["annotations"][scheme_name] = []
 
                        potato_ann["annotations"][scheme_name].append({
                            "start": result["value"]["start"],
                            "end": result["value"]["end"],
                            "label": result["value"]["labels"][0],
                            "text": result["value"]["text"]
                        })
 
                    elif result["type"] == "rating":
                        potato_ann["annotations"][scheme_name] = result["value"]["rating"]
 
                f.write(json.dumps(potato_ann) + "\n")
 
# Usage
convert_annotations("ls_annotated_export.json", "annotations/migrated.jsonl")

Span 标注转换

Label Studio 使用字符偏移;Potato 也使用字符偏移,所以转换很直接:

python
def convert_spans(ls_spans):
    """Convert Label Studio span format to Potato format"""
    potato_spans = []
 
    for span in ls_spans:
        potato_spans.append({
            "start": span["value"]["start"],
            "end": span["value"]["end"],
            "label": span["value"]["labels"][0],
            "text": span["value"]["text"]
        })
 
    return potato_spans

功能映射

Label StudioPotato
Choices (single)radio
Choices (multiple)multiselect
Labelsspan
Ratinglikert
TextAreatext
RectangleLabelsbounding_box
PolygonLabelspolygon
Taxonomy(使用嵌套 multiselect)
Pairwisecomparison

质量控制注意事项

从 Label Studio 迁移时,你需要手动实现质量控制措施。Potato 提供了一些基本的质量控制功能,但没有全面的内置质量控制系统。

质量控制方法

注意力检测:你可以手动将注意力检测项目添加到数据文件中。这些是包含已知正确答案的普通项目,用于验证标注者的注意力:

python
# Add attention check items to your data
attention_items = [
    {"id": "attn_1", "text": "ATTENTION CHECK: Please select 'Positive'", "is_attention": True},
    {"id": "attn_2", "text": "ATTENTION CHECK: Please select 'Negative'", "is_attention": True},
]
 
# Intersperse with regular items
import random
all_items = regular_items + attention_items
random.shuffle(all_items)

一致性计算:使用收集的标注数据离线计算标注者间一致性:

python
from sklearn.metrics import cohen_kappa_score
import numpy as np
 
def compute_agreement(annotations_file):
    """Compute agreement from collected annotations"""
    # Load annotations and compute metrics externally
    # Potato does not have built-in agreement calculation
    pass

冗余标注:通过在数据管理过程中将项目分配给多个用户来配置每个项目的多个标注者。

用户迁移

从 Label Studio 导出用户

python
# Label Studio API call to get users
import requests
 
def export_ls_users(ls_url, api_key):
    response = requests.get(
        f"{ls_url}/api/users",
        headers={"Authorization": f"Token {api_key}"}
    )
    return response.json()

创建 Potato 用户配置

yaml
user_config:
  # Simple auth for migrated users
  auth_type: password
 
  users:
    - username: user1@example.com
      password_hash: "..."  # Generate new passwords
 
    - username: user2@example.com
      password_hash: "..."

测试迁移

验证脚本

python
def validate_migration(original_ls, converted_potato):
    """Validate converted data matches original"""
 
    with open(original_ls) as f:
        ls_data = json.load(f)
 
    with open(converted_potato) as f:
        potato_data = [json.loads(line) for line in f]
 
    # Check item count
    assert len(ls_data) == len(potato_data), "Item count mismatch"
 
    # Check IDs preserved
    ls_ids = {str(item["id"]) for item in ls_data}
    potato_ids = {item["id"] for item in potato_data}
    assert ls_ids == potato_ids, "ID mismatch"
 
    # Check text content
    for ls_item, potato_item in zip(
        sorted(ls_data, key=lambda x: x["id"]),
        sorted(potato_data, key=lambda x: x["id"])
    ):
        assert ls_item["data"]["text"] == potato_item["text"], \
            f"Text mismatch for item {ls_item['id']}"
 
    print("Validation passed!")
 
validate_migration("label_studio_export.json", "data/items.jsonl")

迁移检查清单

  • 从 Label Studio 导出数据(JSON 格式)
  • 手动将模板 XML 转换为 Potato YAML
  • 编写并运行 Python 脚本转换数据格式(JSON 到 JSONL)
  • 编写并运行脚本转换现有标注(如有)
  • 设置 Potato 项目结构
  • 使用示例数据测试
  • 验证转换后的数据与原始数据匹配
  • 培训标注者使用新界面
  • 运行试标注批次

常见问题

字符编码

Label Studio 和 Potato 都使用 UTF-8,但请检查数据中的编码问题。

图像路径

将本地路径转换为 URL,或更新路径以匹配 Potato 的预期格式。

自定义组件

Label Studio 自定义组件需要重新创建为 Potato 自定义模板。

API 差异

如果你自动化了 Label Studio,请更新脚本以使用 Potato 的 API。


需要帮助迁移?查看完整文档或在 GitHub 上联系我们。