Skip to content
Showcase/Prometheus: Rubric-based LLM Evaluation
advancedsurvey

Prometheus: Rubric-based LLM Evaluation

Fine-grained rubric-based evaluation of LLM outputs. Annotators score responses against detailed rubrics (1-5 scale) with specific criteria for each score level, and provide written feedback.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# Prometheus: Rubric-based LLM Evaluation
# Based on "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models" (Kim et al., ICLR 2024)
# Task: Score LLM responses against detailed rubrics and provide written feedback

annotation_task_name: "Prometheus Rubric-based Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing instruction, response, rubric, and reference answer
html_layout: |
  <div class="prometheus-container">
    <div class="instruction-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">Instruction:</h3>
      <div class="instruction-text" style="font-size: 15px;">{{text}}</div>
    </div>
    <div class="response-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #1976d2;">
      <h3 style="margin-top: 0; color: #1976d2;">Model Response:</h3>
      <div class="response-text" style="font-size: 15px; white-space: pre-wrap;">{{response}}</div>
    </div>
    <div class="rubric-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
      <h3 style="margin-top: 0; color: #f9a825;">Scoring Rubric:</h3>
      <div class="rubric-text" style="font-size: 14px; white-space: pre-wrap;">{{rubric}}</div>
    </div>
    <div class="reference-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
      <h3 style="margin-top: 0; color: #7b1fa2;">Reference Answer:</h3>
      <div class="reference-text" style="font-size: 14px; white-space: pre-wrap;">{{reference_answer}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Rubric score (1-5)
  - name: "rubric_score"
    description: "Score the response according to the provided rubric (1-5 scale)."
    annotation_type: likert
    size: 5
    min_label: "1 - Fails criteria"
    max_label: "5 - Exceeds criteria"
    labels:
      - "1 - Does not meet rubric criteria at all"
      - "2 - Meets few rubric criteria with major gaps"
      - "3 - Partially meets rubric criteria"
      - "4 - Meets most rubric criteria well"
      - "5 - Fully meets or exceeds all rubric criteria"
    keyboard_shortcuts:
      "1 - Does not meet rubric criteria at all": "1"
      "2 - Meets few rubric criteria with major gaps": "2"
      "3 - Partially meets rubric criteria": "3"
      "4 - Meets most rubric criteria well": "4"
      "5 - Fully meets or exceeds all rubric criteria": "5"

  # Written feedback
  - name: "feedback"
    description: "Provide detailed written feedback explaining your score. Reference specific rubric criteria and cite examples from the response."
    annotation_type: text
    min_length: 50
    max_length: 800
    placeholder: "Explain how the response meets or fails to meet each rubric criterion. Cite specific examples from the response..."

  # Comparison to reference
  - name: "reference_comparison"
    description: "How does the response compare to the reference answer?"
    annotation_type: likert
    size: 5
    min_label: "1 - Much worse"
    max_label: "5 - Much better"
    labels:
      - "1 - Much worse than reference"
      - "2 - Somewhat worse than reference"
      - "3 - About equal to reference"
      - "4 - Somewhat better than reference"
      - "5 - Much better than reference"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 3

Sample Datasample-data.json

[
  {
    "id": "prom_001",
    "text": "Explain the difference between machine learning and deep learning to someone with no technical background.",
    "response": "Machine learning is like teaching a computer to learn from examples. Imagine showing a computer thousands of photos of cats and dogs - it learns patterns to tell them apart. Deep learning is a specific type of machine learning that uses layers of processing, like how your brain has layers of neurons. Think of machine learning as the big umbrella, and deep learning as one powerful tool under that umbrella. Deep learning is especially good at complex tasks like understanding images and language because it can automatically discover the important features to look for.",
    "rubric": "Score 1: The explanation uses heavy jargon and technical terms without simplification, making it inaccessible to non-technical readers.\nScore 2: The explanation attempts to simplify but still relies on some technical terms without adequate explanation.\nScore 3: The explanation uses some analogies but may be inconsistent in accessibility or miss key differences.\nScore 4: The explanation uses clear analogies and simple language to convey both concepts and their relationship, with only minor unclear points.\nScore 5: The explanation uses excellent analogies, is fully accessible to non-technical readers, clearly distinguishes the two concepts, and explains their relationship intuitively.",
    "reference_answer": "Think of machine learning as teaching a computer to make decisions by showing it lots of examples, similar to how a child learns. You show it many pictures of apples and oranges, and eventually it can tell them apart on its own. Deep learning is a more advanced version of this - it's inspired by how our brains work, using layers of artificial 'neurons.' It's particularly good at complex tasks like recognizing faces in photos or understanding spoken words. The key difference: regular machine learning often needs humans to point out what features to look for (like 'look at the color and shape'), while deep learning figures out the important features on its own."
  },
  {
    "id": "prom_002",
    "text": "Write a professional summary for a software engineer's resume with 5 years of experience in backend development.",
    "response": "Experienced software engineer with 5 years of professional experience specializing in backend development. Proficient in Java, Python, and Node.js with expertise in designing and implementing RESTful APIs, microservices architectures, and database optimization. Demonstrated track record of improving system performance by up to 40% through code optimization and infrastructure improvements. Strong collaborator with experience leading cross-functional teams in Agile environments. Passionate about writing clean, maintainable code and implementing best practices in CI/CD pipelines.",
    "rubric": "Score 1: The summary is generic, lacks specificity, uses first person, or reads like a job description rather than a personal summary.\nScore 2: The summary mentions relevant skills but lacks quantifiable achievements and uses vague language.\nScore 3: The summary includes relevant technologies and some achievements but could be more specific or impactful.\nScore 4: The summary is well-structured, includes specific technologies, quantifiable achievements, and conveys professional value clearly.\nScore 5: The summary is concise yet comprehensive, includes specific technologies and metrics, demonstrates clear value proposition, uses strong action-oriented language, and is tailored for a backend role.",
    "reference_answer": "Results-driven backend engineer with 5+ years building scalable, high-performance systems serving millions of users. Expert in Java and Python microservices, with hands-on experience designing distributed systems using AWS, Kubernetes, and PostgreSQL. Led migration of monolithic application to microservices architecture, reducing deployment time by 60% and improving system reliability to 99.9% uptime. Skilled in mentoring junior developers and driving engineering best practices across Agile teams."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/prometheus-rubric-evaluation
potato start config.yaml

Details

Annotation Types

likerttext

Domain

NLPLLM EvaluationRubric-based Assessment

Use Cases

Rubric EvaluationLLM ScoringQuality Feedback

Tags

rubricprometheusllm-evaluationfine-grainedfeedbackscoring

Found an issue or want to improve this design?

Open an Issue