# Rubric Evaluation

Source: https://www.potatoannotator.com/docs/annotation-types/rubric-eval

The rubric evaluation annotation schema provides a structured grid interface for scoring content across multiple criteria on a defined scale. This schema is ideal for LLM output evaluation, essay grading, translation quality assessment, and any task requiring systematic multi-dimensional scoring.

## Overview

The rubric evaluation schema presents:
- **A grid of criteria** each with its own rating scale
- **Scale labels** ranging from Poor to Excellent (customizable)
- **Optional overall score** that summarizes across criteria
- **Descriptions** for each criterion to guide annotators

This is especially useful for structured human evaluation of generative AI outputs.

## Quick Start

```yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: response_quality
    description: Evaluate the quality of this AI-generated response.
    scale_points: 5
    criteria:
      - name: Accuracy
        description: Is the information factually correct?
      - name: Relevance
        description: Does the response address the question?
      - name: Fluency
        description: Is the response well-written and natural?
```

## Configuration Options

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `annotation_type` | string | Required | Must be `"rubric_eval"` |
| `name` | string | Required | Unique identifier for this schema |
| `description` | string | Required | Instructions displayed to annotators |
| `scale_points` | integer | `5` | Number of points on the rating scale |
| `scale_labels` | array | `["Poor", "Fair", "Average", "Good", "Excellent"]` | Labels for each scale point (length must match `scale_points`) |
| `criteria` | array | Required | List of criteria objects, each with `name` and optional `description` |
| `show_overall` | boolean | `false` | Show an additional overall score row below the criteria |

## Examples

### LLM Output Evaluation

```yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: llm_eval
    description: Rate the quality of this model-generated response.
    scale_points: 5
    scale_labels:
      - Poor
      - Fair
      - Average
      - Good
      - Excellent
    show_overall: true
    criteria:
      - name: Helpfulness
        description: Does the response provide useful and actionable information?
      - name: Accuracy
        description: Is the response factually correct and free of hallucinations?
      - name: Harmlessness
        description: Is the response free of harmful, biased, or inappropriate content?
      - name: Coherence
        description: Is the response logically structured and easy to follow?
```

### Essay Grading

```yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: essay_grade
    description: Grade this student essay using the rubric below.
    scale_points: 4
    scale_labels:
      - Below Expectations
      - Approaching
      - Meets Expectations
      - Exceeds Expectations
    criteria:
      - name: Thesis
        description: Is there a clear and arguable thesis statement?
      - name: Evidence
        description: Does the essay use relevant evidence to support claims?
      - name: Organization
        description: Is the essay logically organized with clear transitions?
      - name: Grammar
        description: Is the writing free of grammatical and spelling errors?
```

### Translation Quality Assessment

```yaml
annotation_schemes:
  - annotation_type: rubric_eval
    name: translation_quality
    description: Evaluate the quality of this machine translation.
    scale_points: 3
    scale_labels:
      - Unacceptable
      - Acceptable
      - Perfect
    criteria:
      - name: Adequacy
        description: Does the translation convey the same meaning as the source?
      - name: Fluency
        description: Does the translation read naturally in the target language?
      - name: Terminology
        description: Are domain-specific terms translated correctly?
```

## Output Format

```json
{
  "response_quality": {
    "labels": {
      "Accuracy": 4,
      "Relevance": 5,
      "Fluency": 3
    },
    "overall": 4
  }
}
```

Each criterion maps to its selected scale value (1-indexed). The `overall` field is included only when `show_overall` is `true`.

## Best Practices

1. **Keep criteria independent** - each criterion should measure a distinct dimension to avoid redundant scoring
2. **Write clear descriptions** - annotators should know exactly what each criterion measures without ambiguity
3. **Use 3-5 scale points** - fewer points reduce cognitive load; more than 7 points rarely improves reliability
4. **Provide anchor examples** - in the description, mention what constitutes each end of the scale
5. **Enable overall score for aggregation** - `show_overall` is useful when you need a single summary metric alongside detailed breakdowns

## Further Reading

- [Likert Scales](/docs/annotation-types/likert-scales) - Single-dimension rating scales
- [Pairwise Comparison](/docs/annotation-types/pairwise-comparison) - Side-by-side evaluation
- [Quality Control](/docs/features/quality-control) - Attention checks and gold standards

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/annotation_schemes.md).
