Showcase/Prometheus: Rubric-Based LLM Evaluation

advancedsurvey

Prometheus: Rubric-Based LLM Evaluation

Prometheus is an open-source evaluator LM that scores a response against a user-defined rubric on a 1-5 scale and writes feedback. This Potato config reproduces that rubric scoring and feedback task for human annotators.

About this dataset

Prometheus is an open-source evaluator language model for fine-grained assessment of generated text, built by Seungone Kim and collaborators at KAIST, NAVER, and partner institutions. The first version was published at ICLR 2024; Prometheus 2 followed at EMNLP 2024.

Prometheus is fine-tuned from Llama-2-Chat (7B and 13B) on the Feedback Collection, a dataset of 1,000 score rubrics, 20,000 instructions (20 per rubric), and 100,000 response-and-feedback pairs covering scores 1 through 5. Each instance pairs an instruction, a response, a reference answer, and a custom rubric with a written rationale and a numerical score.

At evaluation time the model reads an instruction, a response, and a rubric that defines what each score level means, then returns feedback explaining its judgment followed by a score on a 1-5 Likert scale. It is meant as a cheaper, reproducible alternative to GPT-4 and human evaluation for rubric-grounded scoring.

The Potato config below reproduces this task for human annotators: it shows the instruction, the response, and the score rubric, collects a 1-5 score, and captures written feedback that justifies the score.

Score rubrics: 1,000
Instructions: 20,000 (20 per rubric)
Responses + feedback: 100,000
Score scale: 1-5 Likert
Base model: Llama-2-Chat 7B / 13B
Venues: ICLR 2024 / EMNLP 2024 (v2)

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Prometheus: Rubric-based LLM Evaluation
# Based on "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models" (Kim et al., ICLR 2024)
# Task: Score LLM responses against detailed rubrics and provide written feedback

annotation_task_name: "Prometheus Rubric-based Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing instruction, response, rubric, and reference answer
html_layout: |
  <div class="prometheus-container">
    <div class="instruction-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">Instruction:</h3>
      <div class="instruction-text" style="font-size: 15px;">{{text}}</div>
    </div>
    <div class="response-section" style="background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #1976d2;">
      <h3 style="margin-top: 0; color: #1976d2;">Model Response:</h3>
      <div class="response-text" style="font-size: 15px; white-space: pre-wrap;">{{response}}</div>
    </div>
    <div class="rubric-section" style="background: #fff8e1; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #f9a825;">
      <h3 style="margin-top: 0; color: #f9a825;">Scoring Rubric:</h3>
      <div class="rubric-text" style="font-size: 14px; white-space: pre-wrap;">{{rubric}}</div>
    </div>
    <div class="reference-section" style="background: #f3e5f5; padding: 15px; border-radius: 8px; border: 2px solid #7b1fa2;">
      <h3 style="margin-top: 0; color: #7b1fa2;">Reference Answer:</h3>
      <div class="reference-text" style="font-size: 14px; white-space: pre-wrap;">{{reference_answer}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Rubric score (1-5)
  - name: "rubric_score"
    description: "Score the response according to the provided rubric (1-5 scale)."
    annotation_type: likert
    size: 5
    min_label: "1 - Fails criteria"
    max_label: "5 - Exceeds criteria"
    labels:
      - "1 - Does not meet rubric criteria at all"
      - "2 - Meets few rubric criteria with major gaps"
      - "3 - Partially meets rubric criteria"
      - "4 - Meets most rubric criteria well"
      - "5 - Fully meets or exceeds all rubric criteria"
    keyboard_shortcuts:
      "1 - Does not meet rubric criteria at all": "1"
      "2 - Meets few rubric criteria with major gaps": "2"
      "3 - Partially meets rubric criteria": "3"
      "4 - Meets most rubric criteria well": "4"
      "5 - Fully meets or exceeds all rubric criteria": "5"

  # Written feedback
  - name: "feedback"
    description: "Provide detailed written feedback explaining your score. Reference specific rubric criteria and cite examples from the response."
    annotation_type: text
    min_length: 50
    max_length: 800
    placeholder: "Explain how the response meets or fails to meet each rubric criterion. Cite specific examples from the response..."

  # Comparison to reference
  - name: "reference_comparison"
    description: "How does the response compare to the reference answer?"
    annotation_type: likert
    size: 5
    min_label: "1 - Much worse"
    max_label: "5 - Much better"
    labels:
      - "1 - Much worse than reference"
      - "2 - Somewhat worse than reference"
      - "3 - About equal to reference"
      - "4 - Somewhat better than reference"
      - "5 - Much better than reference"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 3

Sample Datasample-data.json

json

[
  {
    "id": "prom_001",
    "text": "Explain the difference between machine learning and deep learning to someone with no technical background.",
    "response": "Machine learning is like teaching a computer to learn from examples. Imagine showing a computer thousands of photos of cats and dogs - it learns patterns to tell them apart. Deep learning is a specific type of machine learning that uses layers of processing, like how your brain has layers of neurons. Think of machine learning as the big umbrella, and deep learning as one powerful tool under that umbrella. Deep learning is especially good at complex tasks like understanding images and language because it can automatically discover the important features to look for.",
    "rubric": "Score 1: The explanation uses heavy jargon and technical terms without simplification, making it inaccessible to non-technical readers.\nScore 2: The explanation attempts to simplify but still relies on some technical terms without adequate explanation.\nScore 3: The explanation uses some analogies but may be inconsistent in accessibility or miss key differences.\nScore 4: The explanation uses clear analogies and simple language to convey both concepts and their relationship, with only minor unclear points.\nScore 5: The explanation uses excellent analogies, is fully accessible to non-technical readers, clearly distinguishes the two concepts, and explains their relationship intuitively.",
    "reference_answer": "Think of machine learning as teaching a computer to make decisions by showing it lots of examples, similar to how a child learns. You show it many pictures of apples and oranges, and eventually it can tell them apart on its own. Deep learning is a more advanced version of this - it's inspired by how our brains work, using layers of artificial 'neurons.' It's particularly good at complex tasks like recognizing faces in photos or understanding spoken words. The key difference: regular machine learning often needs humans to point out what features to look for (like 'look at the color and shape'), while deep learning figures out the important features on its own."
  },
  {
    "id": "prom_002",
    "text": "Write a professional summary for a software engineer's resume with 5 years of experience in backend development.",
    "response": "Experienced software engineer with 5 years of professional experience specializing in backend development. Proficient in Java, Python, and Node.js with expertise in designing and implementing RESTful APIs, microservices architectures, and database optimization. Demonstrated track record of improving system performance by up to 40% through code optimization and infrastructure improvements. Strong collaborator with experience leading cross-functional teams in Agile environments. Passionate about writing clean, maintainable code and implementing best practices in CI/CD pipelines.",
    "rubric": "Score 1: The summary is generic, lacks specificity, uses first person, or reads like a job description rather than a personal summary.\nScore 2: The summary mentions relevant skills but lacks quantifiable achievements and uses vague language.\nScore 3: The summary includes relevant technologies and some achievements but could be more specific or impactful.\nScore 4: The summary is well-structured, includes specific technologies, quantifiable achievements, and conveys professional value clearly.\nScore 5: The summary is concise yet comprehensive, includes specific technologies and metrics, demonstrates clear value proposition, uses strong action-oriented language, and is tailored for a backend role.",
    "reference_answer": "Results-driven backend engineer with 5+ years building scalable, high-performance systems serving millions of users. Expert in Java and Python microservices, with hands-on experience designing distributed systems using AWS, Kubernetes, and PostgreSQL. Led migration of monolithic application to microservices architecture, reducing deployment time by 60% and improving system reliability to 99.9% uptime. Skilled in mentoring junior developers and driving engineering best practices across Agile teams."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/prometheus-rubric-evaluation
potato start config.yaml

Dataset & paper

Kim et al., ICLR 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{kim2024prometheus,
    title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models},
    author={Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and Seo, Minjoon},
    booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
    year={2024},
    url={https://openreview.net/forum?id=8euJaTveKw}
}

Details

Annotation Types

likerttext

Domain

NLPLLM EvaluationRubric-based Assessment

Use Cases

Rubric EvaluationLLM ScoringQuality Feedback

Related Designs

Unlearning Sensitive Content from LLMs

Evaluation of whether language models have successfully unlearned sensitive content, requiring annotators to assess model outputs for residual sensitive information leakage. Based on SemEval-2025 Task 4.

radiotext

WildBench - LLM Evaluation on Real-World Tasks

Evaluation of LLM outputs on challenging real-world user queries from WildBench. Annotators compare two model responses via pairwise preference, rate overall quality on a Likert scale, and provide reasoning for their judgments.

pairwiselikert

Automated Essay Scoring

Holistic and analytic scoring of student essays using a deep-neural approach to automated essay scoring (Uto, arXiv 2022). Annotators provide overall quality ratings, holistic scores on a 1-6 scale, and detailed feedback comments for educational assessment.

likertslider