CodeUltraFeedback: Code Preference Evaluation
Pairwise comparison of code responses with multi-dimensional quality rating. Annotators compare two code solutions to a programming task, select which is better, rate both on correctness, efficiency, readability, completeness, and instruction following, then justify their preference.
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# CodeUltraFeedback: Code Preference Evaluation
# Based on "CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences" (Weyssow et al., arXiv 2024)
annotation_task_name: "CodeUltraFeedback Code Preference"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
<div style="background: #dbeafe; border-left: 4px solid #2563eb; padding: 14px 18px; margin-bottom: 16px; border-radius: 4px;">
<h3 style="margin: 0 0 6px 0; color: #1e3a8a;">Coding Instruction</h3>
<p style="margin: 0; color: #1e293b; font-size: 15px; line-height: 1.5;">{{text}}</p>
<span style="display: inline-block; margin-top: 8px; background: #1e3a8a; color: #fff; padding: 2px 10px; border-radius: 12px; font-size: 12px;">{{language}}</span>
</div>
<div style="display: flex; gap: 16px; margin-bottom: 16px;">
<div style="flex: 1; background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px; overflow-x: auto;">
<h4 style="margin: 0 0 10px 0; color: #166534;">Response A</h4>
<pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_a}}</pre>
</div>
<div style="flex: 1; background: #faf5ff; border: 1px solid #e9d5ff; border-radius: 8px; padding: 16px; overflow-x: auto;">
<h4 style="margin: 0 0 10px 0; color: #6b21a8;">Response B</h4>
<pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_b}}</pre>
</div>
</div>
<div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 14px 18px;">
<p style="margin: 0; color: #64748b; font-size: 13px;">Compare both responses and rate their quality on each dimension below. Then select your overall preference.</p>
</div>
</div>
annotation_schemes:
- name: "preference"
annotation_type: pairwise
description: "Which response is better overall?"
labels:
- "Response A is better"
- "Response B is better"
- "Tie — roughly equal quality"
- name: "quality_dimensions"
annotation_type: multirate
description: "Rate the PREFERRED response on each dimension."
labels:
- "1 - Very Poor"
- "2 - Poor"
- "3 - Average"
- "4 - Good"
- "5 - Excellent"
options:
- "Correctness"
- "Efficiency"
- "Readability"
- "Completeness"
- "Instruction Following"
- name: "preference_rationale"
annotation_type: text
description: "Explain why you prefer one response over the other."
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "cuf-001",
"text": "Write a Python function that implements binary search on a sorted list. The function should return the index of the target element, or -1 if not found.",
"language": "python",
"response_a": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = left + (right - left) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1",
"response_b": "def binary_search(arr, target):\n for i in range(len(arr)):\n if arr[i] == target:\n return i\n return -1"
},
{
"id": "cuf-002",
"text": "Create a JavaScript function that debounces another function, ensuring it only executes after a specified delay since the last call.",
"language": "javascript",
"response_a": "function debounce(fn, delay) {\n let timer;\n return function(...args) {\n clearTimeout(timer);\n timer = setTimeout(() => fn.apply(this, args), delay);\n };\n}",
"response_b": "function debounce(func, wait) {\n let timeoutId = null;\n let lastArgs = null;\n\n function debounced(...args) {\n lastArgs = args;\n if (timeoutId !== null) {\n clearTimeout(timeoutId);\n }\n timeoutId = setTimeout(() => {\n func.apply(this, lastArgs);\n timeoutId = null;\n lastArgs = null;\n }, wait);\n }\n\n debounced.cancel = function() {\n if (timeoutId !== null) {\n clearTimeout(timeoutId);\n timeoutId = null;\n lastArgs = null;\n }\n };\n\n debounced.flush = function() {\n if (timeoutId !== null) {\n clearTimeout(timeoutId);\n func.apply(this, lastArgs);\n timeoutId = null;\n lastArgs = null;\n }\n };\n\n return debounced;\n}"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/codeultrafeedback-code-preference potato start config.yaml
Dataset & paper
Weyssow et al., arXiv 2024
Citation (BibTeX)
@article{weyssow2024codeultrafeedback, title={CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences}, author={Weyssow, Martin and Kamanda, Aton and Zhou, Xin and Sahraoui, Houari}, journal={arXiv preprint arXiv:2403.09032}, year={2024}}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.
SWE-Bench+ Patch Screening
Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.
TrajEval Staged Evaluation
Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.