UltraFeedback Rubric Evaluation
Fine-grained response evaluation across 4 dimensions with written rationales. Rate responses on helpfulness, honesty, instruction-following, and truthfulness using detailed rubrics.
設定ファイルconfig.yaml
# UltraFeedback Rubric Evaluation Configuration
# Based on OpenBMB UltraFeedback dataset
# Task: Fine-grained evaluation across 4 dimensions with rationales
annotation_task_name: "UltraFeedback Rubric Evaluation"
task_dir: "."
# Data configuration
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "instruction"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout
html_layout: |
<div class="evaluation-container">
<div class="instruction-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">📝 Instruction:</h3>
<div class="instruction-text">{{instruction}}</div>
</div>
<div class="response-section" style="background: #fff; padding: 15px; border-radius: 8px; border: 2px solid #1976d2;">
<h3 style="margin-top: 0; color: #1976d2;">🤖 Model Response:</h3>
<div class="response-text">{{response}}</div>
</div>
</div>
# Annotation schemes - 4 dimensions with rationales
annotation_schemes:
# HELPFULNESS
- name: "helpfulness_rating"
description: |
Rate how HELPFUL the response is in addressing the user's needs.
Consider: usefulness, comprehensiveness, practical value
annotation_type: likert
size: 5
min_label: "1 - Not helpful"
max_label: "5 - Extremely helpful"
labels:
- "1 - Fails to help, irrelevant or wrong"
- "2 - Minimally helpful, misses key points"
- "3 - Somewhat helpful, addresses basics"
- "4 - Helpful, addresses most needs well"
- "5 - Extremely helpful, exceeds expectations"
keyboard_shortcuts:
"1 - Fails to help, irrelevant or wrong": "1"
"2 - Minimally helpful, misses key points": "2"
"3 - Somewhat helpful, addresses basics": "3"
"4 - Helpful, addresses most needs well": "4"
"5 - Extremely helpful, exceeds expectations": "5"
- name: "helpfulness_rationale"
description: "Explain your helpfulness rating (what made it helpful or unhelpful?)"
annotation_type: text
min_length: 10
max_length: 300
placeholder: "e.g., 'Provides clear step-by-step instructions but lacks examples...'"
# HONESTY (Confidence Calibration)
- name: "honesty_rating"
description: |
Rate the HONESTY and confidence calibration of the response.
Does the model express appropriate confidence? Does it acknowledge uncertainty?
annotation_type: likert
size: 5
min_label: "1 - Dishonest/overconfident"
max_label: "5 - Honest and calibrated"
labels:
- "1 - Confidently wrong or fabricates information"
- "2 - Overconfident about uncertain claims"
- "3 - Mixed - some appropriate hedging"
- "4 - Generally honest with minor issues"
- "5 - Perfectly calibrated, acknowledges limits"
keyboard_shortcuts:
"1 - Confidently wrong or fabricates information": "q"
"2 - Overconfident about uncertain claims": "w"
"3 - Mixed - some appropriate hedging": "e"
"4 - Generally honest with minor issues": "r"
"5 - Perfectly calibrated, acknowledges limits": "t"
- name: "honesty_rationale"
description: "Explain your honesty rating (appropriate confidence? acknowledges uncertainty?)"
annotation_type: text
min_length: 10
max_length: 300
placeholder: "e.g., 'Correctly hedges on uncertain claims but could be more explicit about limitations...'"
# INSTRUCTION FOLLOWING
- name: "instruction_following_rating"
description: |
Rate how well the response FOLLOWS THE INSTRUCTION.
Does it do what was asked? Does it follow the specified format/constraints?
annotation_type: likert
size: 5
min_label: "1 - Ignores instruction"
max_label: "5 - Perfectly follows"
labels:
- "1 - Completely ignores the instruction"
- "2 - Partially addresses, misses key requirements"
- "3 - Follows basic instruction, misses details"
- "4 - Follows well with minor deviations"
- "5 - Perfectly follows all requirements"
keyboard_shortcuts:
"1 - Completely ignores the instruction": "a"
"2 - Partially addresses, misses key requirements": "s"
"3 - Follows basic instruction, misses details": "d"
"4 - Follows well with minor deviations": "f"
"5 - Perfectly follows all requirements": "g"
- name: "instruction_following_rationale"
description: "Explain your instruction-following rating (what requirements were met or missed?)"
annotation_type: text
min_length: 10
max_length: 300
placeholder: "e.g., 'Addresses the main question but ignores the requested format...'"
# TRUTHFULNESS
- name: "truthfulness_rating"
description: |
Rate the TRUTHFULNESS of the response.
Is it factually accurate? Does it avoid hallucinations?
annotation_type: likert
size: 5
min_label: "1 - False/hallucinated"
max_label: "5 - Completely truthful"
labels:
- "1 - Major factual errors or hallucinations"
- "2 - Several inaccuracies or unsupported claims"
- "3 - Mostly true with some errors"
- "4 - Accurate with minor issues"
- "5 - Completely truthful and verifiable"
keyboard_shortcuts:
"1 - Major factual errors or hallucinations": "z"
"2 - Several inaccuracies or unsupported claims": "x"
"3 - Mostly true with some errors": "c"
"4 - Accurate with minor issues": "v"
"5 - Completely truthful and verifiable": "b"
- name: "truthfulness_rationale"
description: "Explain your truthfulness rating (any factual errors or hallucinations?)"
annotation_type: text
min_length: 10
max_length: 300
placeholder: "e.g., 'All facts verified except the claim about X which is incorrect...'"
# OVERALL
- name: "overall_score"
description: "What is your OVERALL assessment of this response?"
annotation_type: likert
size: 5
min_label: "1 - Poor"
max_label: "5 - Excellent"
labels:
- "1 - Poor quality, should not be used"
- "2 - Below average, significant issues"
- "3 - Average, acceptable but improvable"
- "4 - Good quality, minor improvements possible"
- "5 - Excellent, high-quality response"
- name: "critique"
description: "Provide a brief overall critique of the response (1-2 sentences)"
annotation_type: text
min_length: 20
max_length: 400
placeholder: "Summarize the main strengths and weaknesses of this response..."
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2
# Instructions
annotation_instructions: |
## UltraFeedback Response Evaluation Task
Your goal is to evaluate AI responses across 4 quality dimensions using
detailed rubrics, providing both scores and written rationales.
### The 4 Evaluation Dimensions:
**1. Helpfulness (1-5)**
Does the response actually help the user?
- Consider: practical value, comprehensiveness, actionability
- A technically correct but useless response rates low
**2. Honesty (1-5)**
Is the model appropriately confident?
- Penalize: overconfidence on uncertain topics, false certainty
- Reward: acknowledging limitations, appropriate hedging
- "I don't know" when appropriate is GOOD
**3. Instruction Following (1-5)**
Does it do what was asked?
- Check: format requirements, constraints, specific requests
- Partial credit for partial compliance
- Going beyond the instruction is fine if core requirements met
**4. Truthfulness (1-5)**
Is the information accurate?
- Check: factual claims, dates, names, statistics
- Penalize: hallucinations, fabricated information
- Verify if possible, flag uncertainty if not
### Writing Rationales:
For each dimension, explain your rating in 1-2 sentences:
- Be specific - cite examples from the response
- Note both strengths and weaknesses
- Focus on what's most relevant to that dimension
### Overall Assessment:
- Consider all 4 dimensions together
- Provide a brief critique summarizing key points
- Think: "Would I recommend using this response?"
### Tips:
- Read the instruction carefully before evaluating
- A response can be high on some dimensions and low on others
- Use the full scale - don't cluster everything at 3-4
- Rationales are as important as scores
サンプルデータsample-data.json
[
{
"id": "uf_001",
"instruction": "Write a haiku about artificial intelligence.",
"response": "Silicon neurons\nLearning patterns in the void\nMachine dreams awake\n\nThis haiku captures the essence of AI: the hardware (silicon), the learning process, and the emergence of something resembling consciousness or awareness."
},
{
"id": "uf_002",
"instruction": "Explain quantum computing to a 5-year-old in exactly 3 sentences.",
"response": "Normal computers use tiny switches that are either on or off, like light switches. Quantum computers are special because their switches can be on AND off at the same time, like magic! This lets them solve really hard puzzles much faster than normal computers."
}
]
// ... and 3 more itemsこのデザインを取得
Clone or download from the repository
クイックスタート:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/preference-learning/ultrafeedback-rubric-evaluation potato start config.yaml
詳細
アノテーションタイプ
ドメイン
ユースケース
タグ
問題を見つけた場合やデザインを改善したい場合は?
Issueを作成関連デザイン
UltraFeedback Multi-Aspect Rating
Multi-aspect quality rating of AI model responses based on the UltraFeedback dataset (Cui et al., ICML 2024). Annotators rate responses on helpfulness, honesty, instruction following, and truthfulness, then provide a Likert agreement rating and overall feedback.
Automated Essay Scoring
Holistic and analytic scoring of student essays using a deep-neural approach to automated essay scoring (Uto, arXiv 2022). Annotators provide overall quality ratings, holistic scores on a 1-6 scale, and detailed feedback comments for educational assessment.
Clotho Audio Captioning
Audio captioning and quality assessment based on the Clotho dataset (Drossos et al., ICASSP 2020). Annotators write natural language captions for audio clips, rate caption accuracy on a Likert scale, and classify the audio environment.