Best-Worst Scaling

Potato의 Best-Worst Scaling을 활용한 효율적인 비교 어노테이션 — 비교 튜플을 자동으로 생성하고 선택 결과를 연속적인 품질 점수로 변환합니다.

v2.3.0에서 추가됨

Best-Worst Scaling(BWS)은 Maximum Difference Scaling(MaxDiff)이라고도 하며, 어노테이터에게 항목 튜플(보통 4개)을 보여주고 어떤 기준에 따라 가장 좋은 항목과 가장 나쁜 항목을 선택하게 하는 비교 어노테이션 방법입니다. BWS는 단순한 이진 판단에서 신뢰할 수 있는 스칼라 점수를 산출하며, 동일한 통계적 검정력을 얻는 데 직접 평정 척도보다 훨씬 적은 어노테이션을 필요로 합니다.

BWS는 다음과 같은 경우에 특히 유용합니다.

직접적인 수치 평정이 어노테이터 편향으로 어려움을 겪을 때(척도 사용 방식이 사람마다 다름)
수백 또는 수천 개 항목에 대한 신뢰할 수 있는 순위가 필요할 때
품질 차원이 본질적으로 상대적일 때(예: "어느 번역이 가장 유창한가?")
어노테이션당 정보를 최대화하고 싶을 때(각 BWS 판단은 Likert 평정보다 더 많은 비트를 제공함)

Best-worst scaling 어노테이션 인터페이스 Potato의 비교 어노테이션을 위한 best-worst scaling 인터페이스

기본 구성

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation by fluency"
 
    # Items to compare
    items_key: "translations"    # key in instance data containing the list of items
 
    # Tuple size (how many items shown at once)
    tuple_size: 4                # typically 4; valid range is 3-8
 
    # Labels for best/worst buttons
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
    # Display options
    show_item_labels: true       # show "A", "B", "C", "D" labels
    randomize_order: true        # randomize item order within each tuple
    show_source: false           # optionally show which system produced each item
 
    # Validation
    label_requirement:
      required: true             # must select both best and worst

데이터 형식

데이터 파일의 각 인스턴스에는 비교할 항목 목록이 포함되어야 합니다. Potato는 이 목록에서 튜플을 자동으로 생성합니다.

옵션 1: 모든 항목을 하나의 인스턴스에

순위를 매길 항목이 하나의 집합인 경우(예: 한 문장의 여러 번역):

json

{
  "id": "sent_001",
  "source": "The cat sat on the mat.",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

옵션 2: 사전 생성된 튜플

어떤 항목이 함께 나타날지를 완전히 제어하고 싶다면 사전 생성된 튜플을 제공하세요.

json

{
  "id": "tuple_001",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

자동 튜플 생성

항목 목록이 튜플 크기보다 길면 Potato는 튜플을 자동으로 생성합니다. 생성 알고리즘은 다음을 보장합니다.

모든 항목이 거의 같은 수의 튜플에 나타남
모든 항목 쌍이 적어도 하나의 튜플에서 함께 나타남(신뢰할 수 있는 상대 점수를 위해)
어떤 항목도 항상 처음이나 마지막에 표시되지 않도록 튜플이 균형을 이룸

튜플 생성을 구성합니다.

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    tuple_generation:
      method: balanced_incomplete  # balanced_incomplete or random
      tuples_per_item: 5           # each item appears in ~5 tuples
      seed: 42                     # for reproducibility
      ensure_pair_coverage: true   # every pair co-occurs at least once

N개의 항목, 튜플 크기 T, tuples_per_item = K인 경우 Potato는 총 약 N * K / T개의 튜플을 생성합니다.

생성 방법

balanced_incomplete(기본값): 통계적 효율을 최대화하기 위해 균형 불완비 블록 설계(balanced incomplete block design)를 사용합니다. 모든 항목이 동일한 빈도로 나타나며, 쌍의 공동 출현이 가능한 한 균일합니다. 대부분의 사용 사례에 권장됩니다.

random: 복원 추출로 튜플을 무작위 샘플링합니다. 매우 큰 항목 집합(N > 10,000)에서는 더 빠르지만 통계적으로 덜 효율적입니다. 정확한 균형이 중요하지 않을 때 사용하세요.

CLI로 튜플 사전 생성

대규모 프로젝트의 경우 튜플을 미리 생성하세요.

bash

python -m potato.bws generate-tuples \
  --items data/items.jsonl \
  --tuple-size 4 \
  --tuples-per-item 5 \
  --output data/tuples.jsonl \
  --seed 42

점수 산출 방법

어노테이션이 끝나면 Potato는 세 가지 방법을 사용해 BWS 판단으로부터 항목 점수를 계산합니다.

1. 카운팅 (기본값)

가장 간단한 방법입니다. 각 항목의 점수는 "가장 좋음"으로 선택된 비율에서 "가장 나쁨"으로 선택된 비율을 뺀 값입니다.

Score(item) = (best_count - worst_count) / total_appearances

점수는 -1.0(항상 가장 나쁨)에서 +1.0(항상 가장 좋음) 사이입니다.

bash

python -m potato.bws score \
  --config config.yaml \
  --method counting \
  --output scores.csv

2. Bradley-Terry

BWS 판단이 함축하는 쌍별 비교에 Bradley-Terry 모델을 적합시킵니다. 각 "가장 좋음" 선택은 해당 항목이 튜플 내 다른 모든 항목보다 선호됨을 함축하고, 각 "가장 나쁨" 선택은 다른 모든 항목이 가장 나쁜 항목보다 선호됨을 함축합니다.

Bradley-Terry는 로그-오즈 척도의 점수를 산출하며, 특히 희소한 데이터에서 카운팅보다 더 나은 통계적 특성을 갖습니다.

bash

python -m potato.bws score \
  --config config.yaml \
  --method bradley_terry \
  --max-iter 1000 \
  --tolerance 1e-6 \
  --output scores.csv

3. Plackett-Luce

각 튜플 판단이 함축하는 전체 순위(가장 좋음 > 중간 항목 > 가장 나쁨)를 모델링하는 Bradley-Terry의 일반화입니다. Plackett-Luce는 각 어노테이션에서 Bradley-Terry보다 더 많은 정보를 추출합니다.

bash

python -m potato.bws score \
  --config config.yaml \
  --method plackett_luce \
  --output scores.csv

점수 산출 방법 비교

방법	속도	데이터 효율	희소 데이터 처리	통계 모델
카운팅	빠름	낮음	가능	없음(기술적)
Bradley-Terry	보통	중간	보통	쌍별 비교
Plackett-Luce	더 느림	높음	보통	전체 순위

대부분의 프로젝트에서는 Bradley-Terry가 가장 좋은 기본값입니다. 빠른 탐색적 분석에는 카운팅을, 제한된 어노테이션에서 최대한의 통계적 효율이 필요할 때는 Plackett-Luce를 사용하세요.

YAML에서의 점수 산출 구성

자동 계산을 위해 프로젝트 구성에서 직접 점수 산출을 구성할 수도 있습니다.

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    scoring:
      method: bradley_terry
      auto_compute: true           # compute scores after each annotation session
      output_file: "output/fluency_scores.csv"
      include_confidence: true     # include confidence intervals
      bootstrap_iterations: 1000   # for confidence interval estimation

관리자 대시보드 통합

관리자 대시보드에는 다음을 보여주는 전용 BWS 탭이 포함되어 있습니다.

점수 분포: 현재 항목 점수의 히스토그램
어노테이션 진행 상황: 전체 대비 어노테이션된 튜플 수
항목별 커버리지: 각 항목이 표시된 횟수
어노테이터 간 일관성: BWS 점수의 반분 신뢰도(split-half reliability)
점수 수렴: 어노테이션이 더 많이 수집될수록 점수가 안정되는 과정을 보여주는 선 그래프

명령줄에서 BWS 분석에 접근합니다.

bash

python -m potato.bws stats --config config.yaml

text

BWS Statistics
==============
Schema: fluency
Items: 200
Tuples: 250 (annotated: 180 / 250)
Annotations: 540 (3 annotators)

Score Summary (Bradley-Terry):
  Mean:   0.02
  Std:    0.43
  Range: -0.91 to +0.87

Top 5 Items:
  sys_d:  0.87 (±0.08)
  sys_a:  0.72 (±0.09)
  sys_f:  0.65 (±0.10)
  sys_b:  0.51 (±0.11)
  sys_k:  0.48 (±0.09)

Split-Half Reliability: r = 0.94

여러 BWS 차원

동일한 항목 집합에 여러 BWS 스키마를 실행하여 서로 다른 품질 차원을 평가할 수 있습니다.

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select BEST and WORST by fluency"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
  - annotation_type: best_worst_scaling
    name: adequacy
    description: "Select BEST and WORST by meaning preservation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Accurate"
    worst_label: "Least Accurate"

두 스키마는 동일한 튜플을 공유하므로(Potato는 items_key마다 하나의 튜플 집합을 생성), 어노테이터는 각 튜플을 한 번 보지만 두 가지 판단을 제공합니다.

출력 형식

BWS 어노테이션은 튜플별로 저장됩니다.

json

{
  "id": "tuple_001",
  "annotations": {
    "fluency": {
      "best": "sys_d",
      "worst": "sys_c"
    },
    "adequacy": {
      "best": "sys_a",
      "worst": "sys_c"
    }
  },
  "annotator": "user_1",
  "timestamp": "2026-03-01T14:22:00Z"
}

전체 예시

기계 번역 시스템을 평가하기 위한 전체 구성입니다.

yaml

annotation_task_name: "MT System Ranking (BWS)"
task_dir: "."
 
data_files:
  - "data/mt_tuples.jsonl"
 
item_properties:
  id_key: id
  text_key: source
 
instance_display:
  fields:
    - key: source
      type: text
      display_options:
        label: "Source Sentence"
 
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: overall_quality
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Best Translation"
    worst_label: "Worst Translation"
    randomize_order: true
    show_item_labels: true
 
    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5
      seed: 42
 
    scoring:
      method: bradley_terry
      auto_compute: true
      output_file: "output/quality_scores.csv"
      include_confidence: true
 
output_annotation_dir: "output/"
export_annotation_format: "jsonl"

더 읽어보기

쌍별 비교 -- 더 단순한 두 항목 비교
Likert 척도 -- 직접 평정 대안
Multirate -- 다차원 직접 평정
내보내기 형식 -- 분석을 위해 BWS 데이터 내보내기

구현 세부 사항은 원본 문서를 참조하세요.