CodeUltraFeedback: Code Preference Evaluation

Pairwise comparison of code responses with multi-dimensional quality rating. Annotators compare two code solutions to a programming task, select which is better, rate both on correctness, efficiency, readability, completeness, and instruction following, then justify their preference.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# CodeUltraFeedback: Code Preference Evaluation
# Based on "CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences" (Weyssow et al., arXiv 2024)

annotation_task_name: "CodeUltraFeedback Code Preference"
task_dir: "."
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
    <div style="background: #dbeafe; border-left: 4px solid #2563eb; padding: 14px 18px; margin-bottom: 16px; border-radius: 4px;">
      <h3 style="margin: 0 0 6px 0; color: #1e3a8a;">Coding Instruction</h3>
      <p style="margin: 0; color: #1e293b; font-size: 15px; line-height: 1.5;">{{text}}</p>
      <span style="display: inline-block; margin-top: 8px; background: #1e3a8a; color: #fff; padding: 2px 10px; border-radius: 12px; font-size: 12px;">{{language}}</span>
    </div>

    <div style="display: flex; gap: 16px; margin-bottom: 16px;">
      <div style="flex: 1; background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px; overflow-x: auto;">
        <h4 style="margin: 0 0 10px 0; color: #166534;">Response A</h4>
        <pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_a}}</pre>
      </div>
      <div style="flex: 1; background: #faf5ff; border: 1px solid #e9d5ff; border-radius: 8px; padding: 16px; overflow-x: auto;">
        <h4 style="margin: 0 0 10px 0; color: #6b21a8;">Response B</h4>
        <pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_b}}</pre>
      </div>
    </div>

    <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 14px 18px;">
      <p style="margin: 0; color: #64748b; font-size: 13px;">Compare both responses and rate their quality on each dimension below. Then select your overall preference.</p>
    </div>
  </div>

annotation_schemes:
  - name: "preference"
    annotation_type: pairwise
    description: "Which response is better overall?"
    labels:
      - "Response A is better"
      - "Response B is better"
      - "Tie — roughly equal quality"

  - name: "quality_dimensions"
    annotation_type: multirate
    description: "Rate the PREFERRED response on each dimension."
    labels:
      - "1 - Very Poor"
      - "2 - Poor"
      - "3 - Average"
      - "4 - Good"
      - "5 - Excellent"
    options:
      - "Correctness"
      - "Efficiency"
      - "Readability"
      - "Completeness"
      - "Instruction Following"

  - name: "preference_rationale"
    annotation_type: text
    description: "Explain why you prefer one response over the other."

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "cuf-001",
    "text": "Write a Python function that implements binary search on a sorted list. The function should return the index of the target element, or -1 if not found.",
    "language": "python",
    "response_a": "def binary_search(arr, target):\n    left, right = 0, len(arr) - 1\n    while left <= right:\n        mid = left + (right - left) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            left = mid + 1\n        else:\n            right = mid - 1\n    return -1",
    "response_b": "def binary_search(arr, target):\n    for i in range(len(arr)):\n        if arr[i] == target:\n            return i\n    return -1"
  },
  {
    "id": "cuf-002",
    "text": "Create a JavaScript function that debounces another function, ensuring it only executes after a specified delay since the last call.",
    "language": "javascript",
    "response_a": "function debounce(fn, delay) {\n    let timer;\n    return function(...args) {\n        clearTimeout(timer);\n        timer = setTimeout(() => fn.apply(this, args), delay);\n    };\n}",
    "response_b": "function debounce(func, wait) {\n    let timeoutId = null;\n    let lastArgs = null;\n\n    function debounced(...args) {\n        lastArgs = args;\n        if (timeoutId !== null) {\n            clearTimeout(timeoutId);\n        }\n        timeoutId = setTimeout(() => {\n            func.apply(this, lastArgs);\n            timeoutId = null;\n            lastArgs = null;\n        }, wait);\n    }\n\n    debounced.cancel = function() {\n        if (timeoutId !== null) {\n            clearTimeout(timeoutId);\n            timeoutId = null;\n            lastArgs = null;\n        }\n    };\n\n    debounced.flush = function() {\n        if (timeoutId !== null) {\n            clearTimeout(timeoutId);\n            func.apply(this, lastArgs);\n            timeoutId = null;\n            lastArgs = null;\n        }\n    };\n\n    return debounced;\n}"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/codeultrafeedback-code-preference
potato start config.yaml

Dataset & paper

Weyssow et al., arXiv 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{weyssow2024codeultrafeedback, title={CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences}, author={Weyssow, Martin and Kamanda, Aton and Zhou, Xin and Sahraoui, Houari}, journal={arXiv preprint arXiv:2403.09032}, year={2024}}

Details

Annotation Types

pairwisemultiratetext

Domain

Code GenerationSoftware Engineering

Use Cases

RLHFCode Quality Assessment

Related Designs

DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.

multirateradio

SWE-Bench+ Patch Screening

Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.

radiomultiselect

TrajEval Staged Evaluation

Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.