CodeUltraFeedback: Code Preference Evaluation
Pairwise comparison of code responses with multi-dimensional quality rating. Annotators compare two code solutions to a programming task, select which is better, rate both on correctness, efficiency, readability, completeness, and instruction following, then justify their preference.
Configuration Fileconfig.yaml
# CodeUltraFeedback: Code Preference Evaluation
# Based on "CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences" (Weyssow et al., arXiv 2024)
annotation_task_name: "CodeUltraFeedback Code Preference"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
<div style="background: #dbeafe; border-left: 4px solid #2563eb; padding: 14px 18px; margin-bottom: 16px; border-radius: 4px;">
<h3 style="margin: 0 0 6px 0; color: #1e3a8a;">Coding Instruction</h3>
<p style="margin: 0; color: #1e293b; font-size: 15px; line-height: 1.5;">{{text}}</p>
<span style="display: inline-block; margin-top: 8px; background: #1e3a8a; color: #fff; padding: 2px 10px; border-radius: 12px; font-size: 12px;">{{language}}</span>
</div>
<div style="display: flex; gap: 16px; margin-bottom: 16px;">
<div style="flex: 1; background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px; overflow-x: auto;">
<h4 style="margin: 0 0 10px 0; color: #166534;">Response A</h4>
<pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_a}}</pre>
</div>
<div style="flex: 1; background: #faf5ff; border: 1px solid #e9d5ff; border-radius: 8px; padding: 16px; overflow-x: auto;">
<h4 style="margin: 0 0 10px 0; color: #6b21a8;">Response B</h4>
<pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_b}}</pre>
</div>
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 14px 18px;">
<p style="margin: 0; color: #64748b; font-size: 13px;">Compare both responses and rate their quality on each dimension below. Then select your overall preference.</p>
</div>
</div>
annotation_schemes:
- name: "preference"
annotation_type: pairwise
description: "Which response is better overall?"
labels:
- "Response A is better"
- "Response B is better"
- "Tie — roughly equal quality"
- name: "quality_dimensions"
annotation_type: multirate
description: "Rate the PREFERRED response on each dimension."
labels:
- "1 - Very Poor"
- "2 - Poor"
- "3 - Average"
- "4 - Good"
- "5 - Excellent"
options:
- "Correctness"
- "Efficiency"
- "Readability"
- "Completeness"
- "Instruction Following"
- name: "preference_rationale"
annotation_type: text
description: "Explain why you prefer one response over the other."
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "cuf-001",
"text": "Write a Python function that implements binary search on a sorted list. The function should return the index of the target element, or -1 if not found.",
"language": "python",
"response_a": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = left + (right - left) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1",
"response_b": "def binary_search(arr, target):\n for i in range(len(arr)):\n if arr[i] == target:\n return i\n return -1"
},
{
"id": "cuf-002",
"text": "Create a JavaScript function that debounces another function, ensuring it only executes after a specified delay since the last call.",
"language": "javascript",
"response_a": "function debounce(fn, delay) {\n let timer;\n return function(...args) {\n clearTimeout(timer);\n timer = setTimeout(() => fn.apply(this, args), delay);\n };\n}",
"response_b": "function debounce(func, wait) {\n let timeoutId = null;\n let lastArgs = null;\n\n function debounced(...args) {\n lastArgs = args;\n if (timeoutId !== null) {\n clearTimeout(timeoutId);\n }\n timeoutId = setTimeout(() => {\n func.apply(this, lastArgs);\n timeoutId = null;\n lastArgs = null;\n }, wait);\n }\n\n debounced.cancel = function() {\n if (timeoutId !== null) {\n clearTimeout(timeoutId);\n timeoutId = null;\n lastArgs = null;\n }\n };\n\n debounced.flush = function() {\n if (timeoutId !== null) {\n clearTimeout(timeoutId);\n func.apply(this, lastArgs);\n timeoutId = null;\n lastArgs = null;\n }\n };\n\n return debounced;\n}"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/codeultrafeedback-code-preference potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.
SWE-Bench+ Patch Screening
Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.
TrajEval Staged Evaluation
Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.