Prometheus: Rubric-based LLM Evaluation
Fine-grained rubric-based evaluation of LLM outputs. Annotators score responses against detailed rubrics (1-5 scale) with specific criteria for each score level, and provide written feedback.
Configuration Fileconfig.yaml
# Prometheus: Rubric-based LLM Evaluation
# Based on "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models" (Kim et al., ICLR 2024)
# Task: Score LLM responses against detailed rubrics and provide written feedback
annotation_task_name: "Prometheus Rubric-based Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing instruction, response, rubric, and reference answer
html_layout: |
<div class="prometheus-container">
<div class="instruction-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">Instruction:</h3>
<div class="instruction-text" style="font-size: 15px;">{{text}}</div>
</div>
<div class="response-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #1976d2;">
<h3 style="margin-top: 0; color: #1976d2;">Model Response:</h3>
<div class="response-text" style="font-size: 15px; white-space: pre-wrap;">{{response}}</div>
</div>
<div class="rubric-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
<h3 style="margin-top: 0; color: #f9a825;">Scoring Rubric:</h3>
<div class="rubric-text" style="font-size: 14px; white-space: pre-wrap;">{{rubric}}</div>
</div>
<div class="reference-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
<h3 style="margin-top: 0; color: #7b1fa2;">Reference Answer:</h3>
<div class="reference-text" style="font-size: 14px; white-space: pre-wrap;">{{reference_answer}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Rubric score (1-5)
- name: "rubric_score"
description: "Score the response according to the provided rubric (1-5 scale)."
annotation_type: likert
size: 5
min_label: "1 - Fails criteria"
max_label: "5 - Exceeds criteria"
labels:
- "1 - Does not meet rubric criteria at all"
- "2 - Meets few rubric criteria with major gaps"
- "3 - Partially meets rubric criteria"
- "4 - Meets most rubric criteria well"
- "5 - Fully meets or exceeds all rubric criteria"
keyboard_shortcuts:
"1 - Does not meet rubric criteria at all": "1"
"2 - Meets few rubric criteria with major gaps": "2"
"3 - Partially meets rubric criteria": "3"
"4 - Meets most rubric criteria well": "4"
"5 - Fully meets or exceeds all rubric criteria": "5"
# Written feedback
- name: "feedback"
description: "Provide detailed written feedback explaining your score. Reference specific rubric criteria and cite examples from the response."
annotation_type: text
min_length: 50
max_length: 800
placeholder: "Explain how the response meets or fails to meet each rubric criterion. Cite specific examples from the response..."
# Comparison to reference
- name: "reference_comparison"
description: "How does the response compare to the reference answer?"
annotation_type: likert
size: 5
min_label: "1 - Much worse"
max_label: "5 - Much better"
labels:
- "1 - Much worse than reference"
- "2 - Somewhat worse than reference"
- "3 - About equal to reference"
- "4 - Somewhat better than reference"
- "5 - Much better than reference"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 3
Sample Datasample-data.json
[
{
"id": "prom_001",
"text": "Explain the difference between machine learning and deep learning to someone with no technical background.",
"response": "Machine learning is like teaching a computer to learn from examples. Imagine showing a computer thousands of photos of cats and dogs - it learns patterns to tell them apart. Deep learning is a specific type of machine learning that uses layers of processing, like how your brain has layers of neurons. Think of machine learning as the big umbrella, and deep learning as one powerful tool under that umbrella. Deep learning is especially good at complex tasks like understanding images and language because it can automatically discover the important features to look for.",
"rubric": "Score 1: The explanation uses heavy jargon and technical terms without simplification, making it inaccessible to non-technical readers.\nScore 2: The explanation attempts to simplify but still relies on some technical terms without adequate explanation.\nScore 3: The explanation uses some analogies but may be inconsistent in accessibility or miss key differences.\nScore 4: The explanation uses clear analogies and simple language to convey both concepts and their relationship, with only minor unclear points.\nScore 5: The explanation uses excellent analogies, is fully accessible to non-technical readers, clearly distinguishes the two concepts, and explains their relationship intuitively.",
"reference_answer": "Think of machine learning as teaching a computer to make decisions by showing it lots of examples, similar to how a child learns. You show it many pictures of apples and oranges, and eventually it can tell them apart on its own. Deep learning is a more advanced version of this - it's inspired by how our brains work, using layers of artificial 'neurons.' It's particularly good at complex tasks like recognizing faces in photos or understanding spoken words. The key difference: regular machine learning often needs humans to point out what features to look for (like 'look at the color and shape'), while deep learning figures out the important features on its own."
},
{
"id": "prom_002",
"text": "Write a professional summary for a software engineer's resume with 5 years of experience in backend development.",
"response": "Experienced software engineer with 5 years of professional experience specializing in backend development. Proficient in Java, Python, and Node.js with expertise in designing and implementing RESTful APIs, microservices architectures, and database optimization. Demonstrated track record of improving system performance by up to 40% through code optimization and infrastructure improvements. Strong collaborator with experience leading cross-functional teams in Agile environments. Passionate about writing clean, maintainable code and implementing best practices in CI/CD pipelines.",
"rubric": "Score 1: The summary is generic, lacks specificity, uses first person, or reads like a job description rather than a personal summary.\nScore 2: The summary mentions relevant skills but lacks quantifiable achievements and uses vague language.\nScore 3: The summary includes relevant technologies and some achievements but could be more specific or impactful.\nScore 4: The summary is well-structured, includes specific technologies, quantifiable achievements, and conveys professional value clearly.\nScore 5: The summary is concise yet comprehensive, includes specific technologies and metrics, demonstrates clear value proposition, uses strong action-oriented language, and is tailored for a backend role.",
"reference_answer": "Results-driven backend engineer with 5+ years building scalable, high-performance systems serving millions of users. Expert in Java and Python microservices, with hands-on experience designing distributed systems using AWS, Kubernetes, and PostgreSQL. Led migration of monolithic application to microservices architecture, reducing deployment time by 60% and improving system reliability to 99.9% uptime. Skilled in mentoring junior developers and driving engineering best practices across Agile teams."
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/prometheus-rubric-evaluation potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
Unlearning Sensitive Content from LLMs
Evaluation of whether language models have successfully unlearned sensitive content, requiring annotators to assess model outputs for residual sensitive information leakage. Based on SemEval-2025 Task 4.
WildBench - LLM Evaluation on Real-World Tasks
Evaluation of LLM outputs on challenging real-world user queries from WildBench. Annotators compare two model responses via pairwise preference, rate overall quality on a Likert scale, and provide reasoning for their judgments.
Automated Essay Scoring
Holistic and analytic scoring of student essays using a deep-neural approach to automated essay scoring (Uto, arXiv 2022). Annotators provide overall quality ratings, holistic scores on a 1-6 scale, and detailed feedback comments for educational assessment.