데이터 형식

Potato 입력 형식: 텍스트, JSON, JSONL, CSV, 이미지, 오디오, 비디오. 출력은 CoNLL, HuggingFace Datasets, spaCy, COCO, YOLO, Parquet 등으로 가능합니다.

Potato는 여러 데이터 형식을 기본으로 지원합니다. 이 가이드에서는 어노테이션을 위해 데이터를 어떻게 구성하는지 설명합니다.

How data flows through Potato: input files become instances with an id and text plus metadata, get annotated, and are exported to many formats From input file to exported labels

지원 형식

형식	확장자	설명
JSON	`.json`	객체 배열
JSON Lines	`.jsonl`	한 줄에 하나의 JSON 객체
CSV	`.csv`	쉼표로 구분된 값
TSV	`.tsv`	탭으로 구분된 값

JSON 형식

가장 일반적인 형식입니다. 데이터는 객체 배열이어야 합니다.

json

[
  {
    "id": "doc_001",
    "text": "This is the first document to annotate.",
    "source": "twitter",
    "date": "2024-01-15"
  },
  {
    "id": "doc_002",
    "text": "This is the second document.",
    "source": "reddit",
    "date": "2024-01-16"
  }
]

JSON Lines 형식

각 줄이 별개의 JSON 객체입니다. 대규모 데이터셋에 유용합니다.

jsonl

{"id": "doc_001", "text": "First document"}
{"id": "doc_002", "text": "Second document"}
{"id": "doc_003", "text": "Third document"}

CSV/TSV 형식

헤더가 있는 표 형식 데이터입니다.

csv

id,text,source
doc_001,"This is the first document",twitter
doc_002,"This is the second document",reddit

구성

기본 설정

YAML에서 데이터 파일과 필드 매핑을 구성합니다.

yaml

data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id      # Field name for unique ID
  text_key: text  # Field name for content to annotate

여러 데이터 파일

여러 데이터 소스를 결합합니다.

yaml

data_files:
  - "data/batch_1.json"
  - "data/batch_2.json"
  - "data/batch_3.jsonl"

파일은 순서대로 처리되어 결합됩니다.

데이터 유형

일반 텍스트

단순한 텍스트 내용입니다.

json

{
  "id": "1",
  "text": "The product arrived quickly and works great!"
}

미디어 파일

이미지, 비디오, 오디오를 참조합니다.

json

{
  "id": "1",
  "image_path": "images/photo_001.jpg"
}

yaml

item_properties:
  id_key: id
  image_key: image_path

대화/목록

목록은 자동으로 가로로 표시됩니다.

json

{
  "id": "1",
  "text": "Option A,Option B,Option C"
}

텍스트 쌍

비교 작업에 사용합니다.

json

{
  "id": "pair_001",
  "text": {
    "A": "Response from Model A",
    "B": "Response from Model B"
  }
}

HTML 파일

폴더에 저장된 HTML 파일을 참조합니다.

json

{
  "id": "1",
  "html_file": "html/document_001.html"
}

맥락 기반 어노테이션

주요 텍스트와 함께 맥락 정보를 포함합니다.

json

{
  "id": "1",
  "text": "This is great!",
  "context": "Previous message: How do you like the new feature?"
}

yaml

item_properties:
  id_key: id
  text_key: text
  context_key: context

표시 구성

목록 표시 옵션

목록과 딕셔너리가 표시되는 방식을 제어합니다.

yaml

list_as_text:
  # Add prefixes to items
  text_prefix: "A"  # A., B., C. (or "1" for 1., 2., 3.)
 
  # Display orientation
  horizontal: true  # Side-by-side (false for vertical)
 
  # Randomization
  randomize_values: true   # Shuffle list items
  randomize_keys: true     # Shuffle dictionary keys

HTML 콘텐츠

텍스트에서 HTML 렌더링을 활성화합니다.

yaml

html_content: true

json

{
  "id": "1",
  "text": "<p>This is <strong>formatted</strong> text.</p>"
}

출력 구성

출력 디렉터리

어노테이션이 저장될 위치를 지정합니다.

yaml

output_annotation_dir: "output/"

출력 형식

출력 형식을 선택합니다.

yaml

output_annotation_format: "json"  # json, jsonl, csv, or tsv

출력 구조

어노테이션에는 문서 ID와 응답이 포함됩니다.

json

{
  "id": "doc_001",
  "user": "annotator_1",
  "annotations": {
    "sentiment": "Positive",
    "confidence": 4
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

특수 데이터 유형

Best-Worst 스케일링

순위 매기기 작업에는 쉼표로 구분된 항목을 사용합니다.

json

{
  "id": "1",
  "text": "Item A,Item B,Item C,Item D"
}

사용자 정의 인자

표시나 필터링을 위해 추가 필드를 포함합니다.

json

{
  "id": "1",
  "text": "Document content",
  "category": "news",
  "priority": "high",
  "custom_field": "any value"
}

데이터베이스 백엔드

대규모 데이터셋에는 MySQL을 사용합니다.

yaml

database:
  type: mysql
  host: localhost
  database: potato_db
  user: ${DB_USER}
  password: ${DB_PASSWORD}

Potato는 첫 시작 시 필요한 테이블을 자동으로 생성합니다.

데이터 검증

Potato는 시작 시 데이터를 검증합니다.

ID 필드 누락 - 모든 항목에는 고유 식별자가 필요합니다
텍스트 필드 누락 - 항목에는 어노테이션할 내용이 필요합니다
중복 ID - 모든 ID는 고유해야 합니다
파일을 찾을 수 없음 - 데이터 파일 경로를 확인하십시오

전체 예시

yaml

task_name: "Document Classification"
task_dir: "."
port: 8000
 
# Data configuration
data_files:
  - "data/documents.json"
 
item_properties:
  id_key: id
  text_key: text
  context_key: metadata
 
# Display settings
list_as_text:
  text_prefix: "1"
  horizontal: false
 
# Output
output_annotation_dir: "output/"
output_annotation_format: "json"
 
# Annotation scheme
annotation_schemes:
  - annotation_type: radio
    name: category
    description: "Select the document category"
    labels:
      - News
      - Opinion
      - Tutorial
      - Other
 
allow_all_users: true

모범 사례

1. 의미 있는 ID 사용

추적과 디버깅이 쉬워집니다.

json

{"id": "twitter_2024_001", "text": "..."}

2. 텍스트를 간결하게 유지

긴 텍스트는 어노테이션 속도를 늦춥니다. 다음을 고려하십시오.

핵심 부분만 잘라내기
요약 제공하기
스크롤 컨테이너 사용하기

3. 메타데이터 포함

필터링과 분석에 도움이 됩니다.

json

{
  "id": "1",
  "text": "Content",
  "source": "twitter",
  "date": "2024-01-15",
  "language": "en"
}

4. 불러오기 전에 검증

데이터를 오프라인에서 확인하십시오.

python

import json
 
with open('data.json') as f:
    data = json.load(f)
 
# Check for required fields
for item in data:
    assert 'id' in item, f"Missing id: {item}"
    assert 'text' in item, f"Missing text: {item}"
 
# Check for duplicates
ids = [item['id'] for item in data]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
 
print(f"Validated {len(data)} items")

5. 원본 데이터 백업

재현성을 위해 원시 데이터를 어노테이션과 별도로 보관하십시오.

6. 큰 파일에는 JSON Lines 사용

JSON 배열보다 메모리 효율이 높습니다.

bash

# Convert JSON array to JSON Lines
cat data.json | jq -c '.[]' > data.jsonl

더 읽어보기

데이터 디렉터리 로딩 - 실시간 감시 기능과 함께 디렉터리에서 로드
대화 어노테이션 - 여러 항목 데이터 표시
내보내기 형식 - 출력 형식 옵션

구현 세부 사항은 원본 문서를 참고하십시오.