UltraFeedback Multi-Aspect Rating

Multi-aspect quality rating of AI model responses based on the UltraFeedback dataset (Cui et al., ICML 2024). Annotators rate responses on helpfulness, honesty, instruction following, and truthfulness, then provide a Likert agreement rating and overall feedback.

ملف الإعدادconfig.yaml

# UltraFeedback Multi-Aspect Rating
# Based on Cui et al., ICML 2024
# Paper: https://arxiv.org/abs/2310.01377
# Dataset: https://huggingface.co/datasets/openbmb/UltraFeedback
#
# Multi-aspect quality rating of AI model responses. Annotators evaluate
# each response on four key quality dimensions using a 5-point scale,
# provide an agreement rating on overall quality, and write free-text
# feedback. This data is used to train reward models for RLHF.
#
# Quality Dimensions (rated 1-5 each):
# - Helpfulness: Does the response help the user accomplish their goal?
# - Honesty: Is the response truthful and transparent about uncertainty?
# - Instruction Following: Does the response follow the given instructions?
# - Truthfulness: Is the factual content accurate and verifiable?
#
# Annotation Guidelines:
# 1. Read the instruction carefully
# 2. Read the response thoroughly
# 3. Rate each quality dimension independently on the 5-point scale
# 4. Indicate your overall agreement with response quality
# 5. Provide any additional feedback

annotation_task_name: "UltraFeedback Multi-Aspect Rating"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Multi-aspect quality rating
  - annotation_type: multirate
    name: quality_dimensions
    description: "Rate the response on each quality dimension."
    labels:
      - "1 - Very Poor"
      - "2 - Poor"
      - "3 - Average"
      - "4 - Good"
      - "5 - Excellent"
    options:
      - "Helpfulness"
      - "Honesty"
      - "Instruction Following"
      - "Truthfulness"

  # Step 2: Overall agreement
  - annotation_type: likert
    name: overall_quality
    description: "Overall, this response is of high quality."
    min_label: "Strongly Disagree"
    max_label: "Strongly Agree"
    size: 5

  # Step 3: Free-text feedback
  - annotation_type: text
    name: feedback
    description: "Provide any additional feedback on the response quality."
    textarea: true
    required: false
    placeholder: "What are the strengths and weaknesses of this response?"

annotation_instructions: |
  You will evaluate AI model responses to user instructions on multiple quality dimensions.

  For each item:
  1. Read the user instruction carefully.
  2. Read the AI response thoroughly.
  3. Rate the response on each of the four quality dimensions (1-5 scale):
     - Helpfulness: Does it help the user accomplish their goal?
     - Honesty: Is it truthful and transparent about limitations?
     - Instruction Following: Does it follow the given instructions precisely?
     - Truthfulness: Is the factual content accurate?
  4. Indicate your overall agreement that the response is high quality.
  5. Optionally provide free-text feedback.

  Rating Scale:
  - 1 (Very Poor): Fails completely on this dimension
  - 2 (Poor): Significant deficiencies
  - 3 (Average): Acceptable but with notable room for improvement
  - 4 (Good): Strong performance with minor issues
  - 5 (Excellent): Outstanding performance on this dimension

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Instruction:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #166534;">Response:</strong>
      <div style="font-size: 15px; line-height: 1.7; margin: 8px 0 0 0; white-space: pre-wrap;">{{response}}</div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

بيانات نموذجيةsample-data.json

[
  {
    "id": "uf_001",
    "text": "Explain the concept of blockchain technology in a way that a 12-year-old could understand.",
    "response": "Imagine you and your classmates have a shared notebook where everyone writes down who gave lunch money to whom. But here's the cool part: everyone has their own copy of this notebook, and whenever someone writes something new, all the copies update at the same time. Nobody can erase or change old entries because everyone else would notice their copy doesn't match. That's basically blockchain! It's a digital notebook (ledger) shared across many computers, where every transaction is recorded permanently and can't be secretly changed."
  },
  {
    "id": "uf_002",
    "text": "Write a Python function that reverses a linked list iteratively. Include comments explaining each step.",
    "response": "Here's a Python implementation of iterative linked list reversal:\n\n```python\nclass ListNode:\n    def __init__(self, val=0, next=None):\n        self.val = val\n        self.next = next\n\ndef reverse_linked_list(head):\n    # Initialize three pointers\n    prev = None      # Will become the new head\n    current = head    # Start at the original head\n    \n    while current is not None:\n        # Save the next node before we change the pointer\n        next_node = current.next\n        \n        # Reverse the pointer: instead of pointing forward,\n        # point to the previous node\n        current.next = prev\n        \n        # Move prev and current one step forward\n        prev = current\n        current = next_node\n    \n    # prev is now pointing to the last node,\n    # which is the new head of our reversed list\n    return prev\n```\n\nTime complexity: O(n) - we visit each node exactly once\nSpace complexity: O(1) - we only use three pointers regardless of list size"
  }
]

// ... and 8 more items

احصل على هذا التصميم

View on GitHub

Clone or download from the repository

بدء سريع:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/preference-learning/ultrafeedback-multiaspect
potato start config.yaml

التفاصيل

أنواع التوسيم

multiratelikerttext

المجال

NLPAI Alignment

حالات الاستخدام

Preference LearningMulti-Aspect EvaluationReward Model Training

الوسوم

ultrafeedbackmultiratemulti-aspectquality-ratingicml2024

وجدت مشكلة أو تريد تحسين هذا التصميم؟

افتح مشكلة

تصاميم ذات صلة

UltraFeedback Rubric Evaluation

Fine-grained response evaluation across 4 dimensions with written rationales. Rate responses on helpfulness, honesty, instruction-following, and truthfulness using detailed rubrics.

likerttext

Automated Essay Scoring

Holistic and analytic scoring of student essays using a deep-neural approach to automated essay scoring (Uto, arXiv 2022). Annotators provide overall quality ratings, holistic scores on a 1-6 scale, and detailed feedback comments for educational assessment.

likertslider

Clotho Audio Captioning

Audio captioning and quality assessment based on the Clotho dataset (Drossos et al., ICASSP 2020). Annotators write natural language captions for audio clips, rate caption accuracy on a Likert scale, and classify the audio environment.

textlikert