Esta página aún no está disponible en su idioma. Se muestra la versión en inglés.

Best-Worst Scaling

स्वचालित टपल जेनरेशन और स्कोरिंग के साथ Best-Worst Scaling का उपयोग करके कुशल तुलनात्मक एनोटेशन।

Best-Worst Scaling

v2.3.0 में नया

Best-Worst Scaling (BWS), जिसे Maximum Difference Scaling (MaxDiff) भी कहते हैं, एक तुलनात्मक एनोटेशन विधि है जिसमें एनोटेटर्स को आइटम्स का एक टपल (आमतौर पर 4) दिखाया जाता है और किसी मापदंड के अनुसार सर्वश्रेष्ठ और सबसे खराब आइटम चुनने के लिए कहा जाता है। BWS सरल बाइनरी निर्णयों से विश्वसनीय स्केलर स्कोर उत्पन्न करता है, और समान सांख्यिकीय शक्ति प्राप्त करने के लिए प्रत्यक्ष रेटिंग स्केल की तुलना में बहुत कम एनोटेशन की आवश्यकता होती है।

BWS विशेष रूप से उपयोगी है जब:

प्रत्यक्ष संख्यात्मक रेटिंग में एनोटेटर पूर्वाग्रह होता है (लोगों के बीच स्केल उपयोग भिन्न होता है)
आपको सैकड़ों या हजारों आइटम्स की विश्वसनीय रैंकिंग चाहिए
गुणवत्ता आयाम स्वाभाविक रूप से सापेक्ष है (उदाहरण के लिए, "कौन सा अनुवाद सबसे धाराप्रवाह है?")
आप प्रति एनोटेशन अधिकतम जानकारी चाहते हैं (प्रत्येक BWS निर्णय Likert रेटिंग से अधिक बिट्स प्रदान करता है)

बुनियादी कॉन्फ़िगरेशन

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation by fluency"
 
    # Items to compare
    items_key: "translations"    # key in instance data containing the list of items
 
    # Tuple size (how many items shown at once)
    tuple_size: 4                # typically 4; valid range is 3-8
 
    # Labels for best/worst buttons
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
    # Display options
    show_item_labels: true       # show "A", "B", "C", "D" labels
    randomize_order: true        # randomize item order within each tuple
    show_source: false           # optionally show which system produced each item
 
    # Validation
    label_requirement:
      required: true             # must select both best and worst

डेटा फॉर्मेट

आपकी डेटा फ़ाइल के प्रत्येक इंस्टेंस में तुलना करने के लिए आइटम्स की एक सूची होनी चाहिए। Potato इस सूची से स्वचालित रूप से टपल बनाता है।

विकल्प 1: सभी आइटम एक इंस्टेंस में

यदि आपके पास रैंक करने के लिए आइटम्स का एक सेट है (जैसे, एक वाक्य के अनुवाद):

json

{
  "id": "sent_001",
  "source": "The cat sat on the mat.",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

विकल्प 2: पूर्व-जनित टपल

यदि आप पूर्ण नियंत्रण चाहते हैं कि कौन से आइटम एक साथ दिखाई दें, तो पूर्व-जनित टपल प्रदान करें:

json

{
  "id": "tuple_001",
  "translations": [
    {"id": "sys_a", "text": "Le chat s'est assis sur le tapis."},
    {"id": "sys_b", "text": "Le chat a assis sur le tapis."},
    {"id": "sys_c", "text": "Le chat etait assis sur le mat."},
    {"id": "sys_d", "text": "Le chat se tenait sur le tapis."}
  ]
}

स्वचालित टपल जेनरेशन

जब आपकी आइटम्स सूची टपल आकार से बड़ी होती है, Potato स्वचालित रूप से टपल बनाता है। जेनरेशन एल्गोरिदम सुनिश्चित करता है:

प्रत्येक आइटम लगभग समान संख्या के टपल में दिखाई देता है
प्रत्येक आइटम जोड़ी कम से कम एक टपल में सह-उपस्थित होती है (विश्वसनीय सापेक्ष स्कोरिंग के लिए)
टपल संतुलित होते हैं ताकि कोई आइटम हमेशा पहले या अंत में न दिखे

टपल जेनरेशन कॉन्फ़िगर करें:

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    tuple_generation:
      method: balanced_incomplete  # balanced_incomplete or random
      tuples_per_item: 5           # each item appears in ~5 tuples
      seed: 42                     # for reproducibility
      ensure_pair_coverage: true   # every pair co-occurs at least once

N आइटम्स के एक सेट के लिए टपल आकार T और tuples_per_item = K के साथ, Potato लगभग N * K / T कुल टपल बनाता है।

जेनरेशन विधियाँ

balanced_incomplete (डिफ़ॉल्ट): सांख्यिकीय दक्षता को अधिकतम करने के लिए संतुलित अपूर्ण ब्लॉक डिज़ाइन का उपयोग करता है। प्रत्येक आइटम समान रूप से प्रकट होता है, और जोड़ी सह-उपस्थिति यथासंभव समान होती है। अधिकांश उपयोग मामलों के लिए अनुशंसित।

random: प्रतिस्थापन के साथ यादृच्छिक रूप से टपल का नमूना लेता है। बहुत बड़े आइटम सेट (N > 10,000) के लिए तेज़ लेकिन सांख्यिकीय रूप से कम कुशल। जब सटीक संतुलन महत्वपूर्ण नहीं है तब उपयोग करें।

CLI के माध्यम से पूर्व-जनित टपल

बड़े पैमाने की परियोजनाओं के लिए, पहले से टपल जनरेट करें:

bash

python -m potato.bws generate-tuples \
  --items data/items.jsonl \
  --tuple-size 4 \
  --tuples-per-item 5 \
  --output data/tuples.jsonl \
  --seed 42

स्कोरिंग विधियाँ

एनोटेशन के बाद, Potato तीन विधियों का उपयोग करके BWS निर्णयों से आइटम स्कोर की गणना करता है।

1. गणना (डिफ़ॉल्ट)

सबसे सरल विधि। प्रत्येक आइटम का स्कोर उसके "सर्वश्रेष्ठ" के रूप में चुने जाने के अनुपात और "सबसे खराब" के रूप में चुने जाने के अनुपात का अंतर है:

Score(item) = (best_count - worst_count) / total_appearances

स्कोर -1.0 (हमेशा सबसे खराब) से +1.0 (हमेशा सर्वश्रेष्ठ) तक होते हैं।

bash

python -m potato.bws score \
  --config config.yaml \
  --method counting \
  --output scores.csv

2. Bradley-Terry

BWS निर्णयों द्वारा निहित पेयरवाइज़ तुलनाओं के लिए Bradley-Terry मॉडल फ़िट करता है। प्रत्येक "सर्वश्रेष्ठ" चयन यह इंगित करता है कि सर्वश्रेष्ठ आइटम टपल के सभी अन्य आइटम्स से बेहतर है; प्रत्येक "सबसे खराब" चयन यह इंगित करता है कि सभी अन्य आइटम्स सबसे खराब से बेहतर हैं।

Bradley-Terry विरल डेटा के साथ, विशेष रूप से, गणना की तुलना में बेहतर सांख्यिकीय गुणों के साथ log-odds स्केल पर स्कोर उत्पन्न करता है।

bash

python -m potato.bws score \
  --config config.yaml \
  --method bradley_terry \
  --max-iter 1000 \
  --tolerance 1e-6 \
  --output scores.csv

3. Plackett-Luce

Bradley-Terry का सामान्यीकरण जो प्रत्येक टपल निर्णय (सर्वश्रेष्ठ > मध्य आइटम > सबसे खराब) द्वारा निहित पूर्ण रैंकिंग को मॉडल करता है। Plackett-Luce प्रत्येक एनोटेशन से Bradley-Terry की तुलना में अधिक जानकारी निकालता है।

bash

python -m potato.bws score \
  --config config.yaml \
  --method plackett_luce \
  --output scores.csv

स्कोरिंग विधियों की तुलना

विधि	गति	डेटा दक्षता	विरल डेटा संभालता है	सांख्यिकीय मॉडल
Counting	तेज़	कम	हाँ	कोई नहीं (वर्णनात्मक)
Bradley-Terry	मध्यम	मध्यम	मध्यम	पेयरवाइज़ तुलना
Plackett-Luce	धीमा	उच्च	मध्यम	पूर्ण रैंकिंग

अधिकांश परियोजनाओं के लिए, Bradley-Terry सबसे अच्छा डिफ़ॉल्ट है। त्वरित खोजपूर्ण विश्लेषण के लिए Counting और सीमित एनोटेशन से अधिकतम सांख्यिकीय दक्षता की आवश्यकता होने पर Plackett-Luce का उपयोग करें।

YAML में स्कोरिंग कॉन्फ़िगरेशन

आप परियोजना config में सीधे स्कोरिंग कॉन्फ़िगर कर सकते हैं:

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    items_key: "translations"
    tuple_size: 4
 
    scoring:
      method: bradley_terry
      auto_compute: true           # compute scores after each annotation session
      output_file: "output/fluency_scores.csv"
      include_confidence: true     # include confidence intervals
      bootstrap_iterations: 1000   # for confidence interval estimation

एडमिन डैशबोर्ड एकीकरण

एडमिन डैशबोर्ड में एक समर्पित BWS टैब शामिल है जो दिखाता है:

स्कोर वितरण: वर्तमान आइटम स्कोर का हिस्टोग्राम
एनोटेशन प्रगति: कितने टपल एनोटेट किए गए बनाम कुल
प्रति-आइटम कवरेज: प्रत्येक आइटम कितनी बार देखा गया
इंटर-एनोटेटर संगति: BWS स्कोर की स्प्लिट-हाफ विश्वसनीयता
स्कोर अभिसरण: लाइन चार्ट दिखाता है कि अधिक एनोटेशन संग्रह होने पर स्कोर कैसे स्थिर होते हैं

कमांड लाइन से BWS analytics तक पहुँचें:

bash

python -m potato.bws stats --config config.yaml

text

BWS Statistics
==============
Schema: fluency
Items: 200
Tuples: 250 (annotated: 180 / 250)
Annotations: 540 (3 annotators)

Score Summary (Bradley-Terry):
  Mean:   0.02
  Std:    0.43
  Range: -0.91 to +0.87

Top 5 Items:
  sys_d:  0.87 (±0.08)
  sys_a:  0.72 (±0.09)
  sys_f:  0.65 (±0.10)
  sys_b:  0.51 (±0.11)
  sys_k:  0.48 (±0.09)

Split-Half Reliability: r = 0.94

एकाधिक BWS आयाम

आप विभिन्न गुणवत्ता आयामों का मूल्यांकन करने के लिए एक ही आइटम सेट पर कई BWS स्कीमा चला सकते हैं:

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select BEST and WORST by fluency"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
 
  - annotation_type: best_worst_scaling
    name: adequacy
    description: "Select BEST and WORST by meaning preservation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Accurate"
    worst_label: "Least Accurate"

दोनों स्कीमा एक ही टपल साझा करते हैं (Potato प्रति items_key टपल का एक सेट बनाता है), इसलिए एनोटेटर प्रत्येक टपल एक बार देखते हैं लेकिन दो निर्णय देते हैं।

आउटपुट फॉर्मेट

BWS एनोटेशन प्रति-टपल सहेजे जाते हैं:

json

{
  "id": "tuple_001",
  "annotations": {
    "fluency": {
      "best": "sys_d",
      "worst": "sys_c"
    },
    "adequacy": {
      "best": "sys_a",
      "worst": "sys_c"
    }
  },
  "annotator": "user_1",
  "timestamp": "2026-03-01T14:22:00Z"
}

पूर्ण उदाहरण

मशीन अनुवाद प्रणालियों के मूल्यांकन के लिए पूर्ण कॉन्फ़िगरेशन:

yaml

task_name: "MT System Ranking (BWS)"
task_dir: "."
 
data_files:
  - "data/mt_tuples.jsonl"
 
item_properties:
  id_key: id
  text_key: source
 
instance_display:
  fields:
    - key: source
      type: text
      display_options:
        label: "Source Sentence"
 
annotation_schemes:
  - annotation_type: best_worst_scaling
    name: overall_quality
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Best Translation"
    worst_label: "Worst Translation"
    randomize_order: true
    show_item_labels: true
 
    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5
      seed: 42
 
    scoring:
      method: bradley_terry
      auto_compute: true
      output_file: "output/quality_scores.csv"
      include_confidence: true
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

अधिक पढ़ें

Pairwise Comparison -- सरल दो-आइटम तुलना
Likert Scales -- प्रत्यक्ष रेटिंग विकल्प
Multirate -- बहु-आयामी प्रत्यक्ष रेटिंग
Export Formats -- विश्लेषण के लिए BWS डेटा निर्यात करें

कार्यान्वयन विवरण के लिए, स्रोत दस्तावेज़ीकरण देखें।