Best-Worst Scaling

使用 Best-Worst Scaling 进行高效的比较式标注，支持自动元组生成和评分。

v2.3.0 新增

Best-Worst Scaling（BWS），也称为最大差异缩放（MaxDiff），是一种比较式标注方法。标注人员看到一组项目（通常 4 个），根据某个标准选择最好和最差的项目。BWS 从简单的二元判断中生成可靠的标量分数，在达到相同统计效力时所需的标注量远少于直接评分量表。

BWS 特别适用于以下场景：

直接数值评分受标注者偏差影响（不同人的量表使用习惯不同）
你需要对数百或数千个项目进行可靠排名
质量维度本质上是相对的（例如，"哪个翻译最流畅？"）
你希望最大化每次标注的信息量（每次 BWS 判断比 Likert 评分提供更多信息位）

基本配置

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation by fluency"
 
    # Items to compare
    items_key: "translations"    # key in instance data containing the list of items
 
    # Tuple size (how many items shown at once)
    tuple_size: 4                # typically 4; valid range is 3-8
 
    # Labels for best/worst buttons
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
    # Display options
    show_item_labels: true       # show "A", "B", "C", "D" labels
    randomize_order: true        # randomize item order within each tuple
    show_source: false           # optionally show which system produced each item
 
    # Validation
    label_requirement:
      required: true             # must select both best and worst

数据格式

数据文件中的每个实例应包含一个待比较的项目列表。Potato 自动从此列表生成元组。

选项 1：所有项目在一个实例中

如果你有一组需要排名的项目（例如，一个句子的多个翻译）：

json

{
  "id": "sent_001",
  "source": "The cat sat on the mat.",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

选项 2：预生成的元组

如果你想完全控制哪些项目同时出现，可以提供预生成的元组：

json

{
  "id": "tuple_001",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

自动元组生成

当项目列表长度超过元组大小时，Potato 自动生成元组。生成算法确保：

每个项目出现在大致相同数量的元组中
每对项目至少在一个元组中共同出现（用于可靠的相对评分）
元组是平衡的，没有项目总是排在第一或最后

配置元组生成：

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    tuple_generation:
      method: balanced_incomplete  # balanced_incomplete or random
      tuples_per_item: 5           # each item appears in ~5 tuples
      seed: 42                     # for reproducibility
      ensure_pair_coverage: true   # every pair co-occurs at least once

对于包含 N 个项目、元组大小为 T、tuples_per_item = K 的集合，Potato 大约生成 N * K / T 个元组。

生成方法

balanced_incomplete（默认）：使用平衡不完全区组设计来最大化统计效率。每个项目出现频率相等，配对共现尽可能均匀。推荐用于大多数场景。

random：有放回地随机采样元组。对于非常大的项目集（N > 10,000）更快，但统计效率较低。当精确平衡不重要时使用。

通过 CLI 预生成元组

对于大规模项目，提前生成元组：

bash

python -m potato.bws generate-tuples \
  --items data/items.jsonl \
  --tuple-size 4 \
  --tuples-per-item 5 \
  --output data/tuples.jsonl \
  --seed 42

评分方法

标注完成后，Potato 使用三种方法从 BWS 判断中计算项目分数。

1. 计数法（默认）

最简单的方法。每个项目的分数是被选为"最好"的比例减去被选为"最差"的比例：

Score(item) = (best_count - worst_count) / total_appearances

分数范围从 -1.0（总是最差）到 +1.0（总是最好）。

bash

python -m potato.bws score \
  --config config.yaml \
  --method counting \
  --output scores.csv

2. Bradley-Terry

将 Bradley-Terry 模型拟合到 BWS 判断所隐含的成对比较中。每次"最好"选择意味着该项目优于元组中的所有其他项目；每次"最差"选择意味着所有其他项目优于最差项目。

Bradley-Terry 在对数几率尺度上产生分数，具有比计数法更好的统计性质，尤其是在数据稀疏时。

bash

python -m potato.bws score \
  --config config.yaml \
  --method bradley_terry \
  --max-iter 1000 \
  --tolerance 1e-6 \
  --output scores.csv

3. Plackett-Luce

Bradley-Terry 的推广，对每个元组判断所隐含的完整排名进行建模（最好 > 中间项目 > 最差）。Plackett-Luce 从每次标注中提取比 Bradley-Terry 更多的信息。

bash

python -m potato.bws score \
  --config config.yaml \
  --method plackett_luce \
  --output scores.csv

评分方法比较

方法	速度	数据效率	处理稀疏数据	统计模型
计数法	快	低	是	无（描述性）
Bradley-Terry	中等	中等	中等	成对比较
Plackett-Luce	较慢	高	中等	完整排名

对于大多数项目，Bradley-Terry 是最佳默认选择。使用计数法进行快速探索性分析，使用 Plackett-Luce 当你需要从有限标注中获得最大统计效率。

YAML 中的评分配置

你也可以直接在项目配置中配置评分以实现自动计算：

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    scoring:
      method: bradley_terry
      auto_compute: true           # compute scores after each annotation session
      output_file: "output/fluency_scores.csv"
      include_confidence: true     # include confidence intervals
      bootstrap_iterations: 1000   # for confidence interval estimation

管理面板集成

管理面板包含一个专用的 BWS 标签页，显示：

分数分布：当前项目分数的直方图
标注进度：已标注元组数与总数
每项覆盖率：每个项目被查看的次数
标注者间一致性：BWS 分数的分半信度
分数收敛：折线图显示随着更多标注收集分数如何趋于稳定

通过命令行访问 BWS 分析：

bash

python -m potato.bws stats --config config.yaml

text

BWS Statistics
==============
Schema: fluency
Items: 200
Tuples: 250 (annotated: 180 / 250)
Annotations: 540 (3 annotators)

Score Summary (Bradley-Terry):
  Mean:   0.02
  Std:    0.43
  Range: -0.91 to +0.87

Top 5 Items:
  sys_d:  0.87 (±0.08)
  sys_a:  0.72 (±0.09)
  sys_f:  0.65 (±0.10)
  sys_b:  0.51 (±0.11)
  sys_k:  0.48 (±0.09)

Split-Half Reliability: r = 0.94

多维度 BWS

你可以在同一组项目上运行多个 BWS 方案来评估不同的质量维度：

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select BEST and WORST by fluency"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
  - annotation_type: best_worst_scaling
    name: adequacy
    description: "Select BEST and WORST by meaning preservation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Accurate"
    worst_label: "Least Accurate"

两个方案共享相同的元组（Potato 为每个 items_key 生成一组元组），因此标注人员看到每个元组一次但提供两个判断。

输出格式

BWS 标注按元组保存：

json

{
  "id": "tuple_001",
  "annotations": {
    "fluency": {
      "best": "sys_d",
      "worst": "sys_c"
    },
    "adequacy": {
      "best": "sys_a",
      "worst": "sys_c"
    }
  },
  "annotator": "user_1",
  "timestamp": "2026-03-01T14:22:00Z"
}

完整示例

用于评估机器翻译系统的完整配置：

yaml

task_name: "MT System Ranking (BWS)"
task_dir: "."
 
data_files:
  - "data/mt_tuples.jsonl"
 
item_properties:
  id_key: id
  text_key: source
 
instance_display:
  fields:
    - key: source
      type: text
      display_options:
        label: "Source Sentence"
 
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: overall_quality
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Best Translation"
    worst_label: "Worst Translation"
    randomize_order: true
    show_item_labels: true
 
    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5
      seed: 42
 
    scoring:
      method: bradley_terry
      auto_compute: true
      output_file: "output/quality_scores.csv"
      include_confidence: true
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

Best-Worst Scaling

基本配置

数据格式

选项 1：所有项目在一个实例中

选项 2：预生成的元组

自动元组生成

生成方法

通过 CLI 预生成元组

评分方法

1. 计数法（默认）

2. Bradley-Terry

3. Plackett-Luce

评分方法比较

YAML 中的评分配置

管理面板集成

多维度 BWS

输出格式

完整示例

延伸阅读