Best-Worst Scaling

自動タプル生成とスコアリングによるBest-Worst Scalingを使用した効率的な比較アノテーション。

v2.3.0の新機能

Best-Worst Scaling（BWS）は、Maximum Difference Scaling（MaxDiff）とも呼ばれ、アノテーターにアイテムのタプル（通常4つ）を提示し、ある基準に従って最良と最悪のアイテムを選択させる比較アノテーション手法です。BWSはシンプルな二値判断から信頼性の高いスカラースコアを生成し、同じ統計的検出力を達成するために直接評価スケールよりもはるかに少ないアノテーションで済みます。

BWSは以下の場合に特に有用です：

直接的な数値評価がアノテーターバイアスに影響される場合（スケールの使い方が人によって異なる）
数百〜数千のアイテムの信頼性の高いランキングが必要な場合
品質の次元が本質的に相対的な場合（例：「どの翻訳が最も流暢か？」）
アノテーションあたりの情報を最大化したい場合（各BWS判断はLikert評価よりも多くのビットを提供する）

基本設定

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation by fluency"
 
    # Items to compare
    items_key: "translations"    # key in instance data containing the list of items
 
    # Tuple size (how many items shown at once)
    tuple_size: 4                # typically 4; valid range is 3-8
 
    # Labels for best/worst buttons
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
    # Display options
    show_item_labels: true       # show "A", "B", "C", "D" labels
    randomize_order: true        # randomize item order within each tuple
    show_source: false           # optionally show which system produced each item
 
    # Validation
    label_requirement:
      required: true             # must select both best and worst

データ形式

データファイルの各インスタンスには、比較するアイテムのリストが含まれている必要があります。Potatoはこのリストから自動的にタプルを生成します。

オプション1：1つのインスタンスにすべてのアイテム

ランキングする単一のアイテムセットがある場合（例：1文の翻訳）：

json

{
  "id": "sent_001",
  "source": "The cat sat on the mat.",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

オプション2：事前生成されたタプル

どのアイテムが一緒に表示されるかを完全に制御したい場合、事前生成されたタプルを提供してください：

json

{
  "id": "tuple_001",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

自動タプル生成

アイテムリストがタプルサイズより長い場合、Potatoは自動的にタプルを生成します。生成アルゴリズムは以下を保証します：

すべてのアイテムがほぼ同じ数のタプルに出現する
すべてのアイテムペアが少なくとも1つのタプルで共起する（信頼性の高い相対スコアリングのため）
タプルがバランスされ、特定のアイテムが常に最初や最後に表示されないようにする

タプル生成を設定：

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    tuple_generation:
      method: balanced_incomplete  # balanced_incomplete or random
      tuples_per_item: 5           # each item appears in ~5 tuples
      seed: 42                     # for reproducibility
      ensure_pair_coverage: true   # every pair co-occurs at least once

N個のアイテム、タプルサイズT、tuples_per_item = Kの場合、Potatoは約N * K / T個のタプルを生成します。

生成方法

balanced_incomplete（デフォルト）：統計的効率を最大化するために均衡不完全ブロック設計を使用します。すべてのアイテムが均等に出現し、ペアの共起はできるだけ均一になります。ほとんどのユースケースに推奨されます。

random：置換ありでランダムにタプルをサンプリングします。非常に大きなアイテムセット（N > 10,000）では高速ですが、統計的効率は低くなります。正確なバランスが重要でない場合に使用してください。

CLIによるタプルの事前生成

大規模プロジェクトの場合、事前にタプルを生成してください：

bash

python -m potato.bws generate-tuples \
  --items data/items.jsonl \
  --tuple-size 4 \
  --tuples-per-item 5 \
  --output data/tuples.jsonl \
  --seed 42

スコアリング方法

アノテーション後、PotatoはBWS判断からアイテムスコアを3つの方法で計算します。

1. カウンティング（デフォルト）

最もシンプルな方法です。各アイテムのスコアは、「最良」として選択された割合から「最悪」として選択された割合を引いたものです：

Score(item) = (best_count - worst_count) / total_appearances

スコアは-1.0（常に最悪）から+1.0（常に最良）の範囲です。

bash

python -m potato.bws score \
  --config config.yaml \
  --method counting \
  --output scores.csv

2. Bradley-Terry

BWS判断から暗示されるペアワイズ比較にBradley-Terryモデルをフィットします。各「最良」選択は、最良アイテムがタプル内の他のすべてのアイテムより好まれることを暗示し、各「最悪」選択は、他のすべてのアイテムが最悪アイテムより好まれることを暗示します。

Bradley-Terryは対数オッズスケールでスコアを生成し、特にスパースデータでカウンティングよりも優れた統計的性質を持ちます。

bash

python -m potato.bws score \
  --config config.yaml \
  --method bradley_terry \
  --max-iter 1000 \
  --tolerance 1e-6 \
  --output scores.csv

3. Plackett-Luce

各タプル判断から暗示されるフルランキング（best > middle items > worst）をモデル化するBradley-Terryの一般化です。Plackett-LuceはBradley-Terryよりも各アノテーションからより多くの情報を抽出します。

bash

python -m potato.bws score \
  --config config.yaml \
  --method plackett_luce \
  --output scores.csv

スコアリング方法の比較

方法	速度	データ効率	スパースデータへの対応	統計モデル
カウンティング	高速	低い	対応可	なし（記述的）
Bradley-Terry	中程度	中程度	ある程度対応	ペアワイズ比較
Plackett-Luce	低速	高い	ある程度対応	フルランキング

ほとんどのプロジェクトでは、Bradley-Terryが最良のデフォルトです。迅速な探索的分析にはカウンティングを、限られたアノテーションからの最大統計効率が必要な場合にはPlackett-Luceを使用してください。

YAMLでのスコアリング設定

自動計算のためにプロジェクト設定で直接スコアリングを設定することもできます：

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    scoring:
      method: bradley_terry
      auto_compute: true           # compute scores after each annotation session
      output_file: "output/fluency_scores.csv"
      include_confidence: true     # include confidence intervals
      bootstrap_iterations: 1000   # for confidence interval estimation

管理ダッシュボードとの統合

管理ダッシュボードには、以下を表示する専用BWSタブが含まれています：

スコア分布：現在のアイテムスコアのヒストグラム
アノテーション進捗：アノテーション済みタプル数 vs. 合計数
アイテムごとのカバレッジ：各アイテムが表示された回数
アノテーター間の一貫性：BWSスコアの分割半信頼性
スコア収束：アノテーションが増えるにつれてスコアが安定する様子を示すラインチャート

コマンドラインからBWS分析にアクセス：

bash

python -m potato.bws stats --config config.yaml

text

BWS Statistics
==============
Schema: fluency
Items: 200
Tuples: 250 (annotated: 180 / 250)
Annotations: 540 (3 annotators)

Score Summary (Bradley-Terry):
  Mean:   0.02
  Std:    0.43
  Range: -0.91 to +0.87

Top 5 Items:
  sys_d:  0.87 (±0.08)
  sys_a:  0.72 (±0.09)
  sys_f:  0.65 (±0.10)
  sys_b:  0.51 (±0.11)
  sys_k:  0.48 (±0.09)

Split-Half Reliability: r = 0.94

複数BWS次元

同じアイテムセットで複数のBWSスキーマを実行し、異なる品質次元を評価できます：

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select BEST and WORST by fluency"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
  - annotation_type: best_worst_scaling
    name: adequacy
    description: "Select BEST and WORST by meaning preservation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Accurate"
    worst_label: "Least Accurate"

両方のスキーマは同じタプルを共有します（Potatoはitems_keyごとに1セットのタプルを生成）ので、アノテーターは各タプルを1回見ますが、2つの判断を提供します。

出力形式

BWSアノテーションはタプルごとに保存されます：

json

{
  "id": "tuple_001",
  "annotations": {
    "fluency": {
      "best": "sys_d",
      "worst": "sys_c"
    },
    "adequacy": {
      "best": "sys_a",
      "worst": "sys_c"
    }
  },
  "annotator": "user_1",
  "timestamp": "2026-03-01T14:22:00Z"
}

完全な例

機械翻訳システム評価のための完全な設定：

yaml

task_name: "MT System Ranking (BWS)"
task_dir: "."
 
data_files:
  - "data/mt_tuples.jsonl"
 
item_properties:
  id_key: id
  text_key: source
 
instance_display:
  fields:
    - key: source
      type: text
      display_options:
        label: "Source Sentence"
 
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: overall_quality
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Best Translation"
    worst_label: "Worst Translation"
    randomize_order: true
    show_item_labels: true
 
    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5
      seed: 42
 
    scoring:
      method: bradley_terry
      auto_compute: true
      output_file: "output/quality_scores.csv"
      include_confidence: true
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

参考資料

ペアワイズ比較 -- よりシンプルな2アイテム比較
Likertスケール -- 直接評価の代替手段
マルチレート -- 多次元直接評価
エクスポート形式 -- BWS データの分析用エクスポート

実装の詳細については、ソースドキュメントを参照してください。