Skip to content
Showcase/CodeUltraFeedback: Code Preference Evaluation
advancedpreference

CodeUltraFeedback: Code Preference Evaluation

Pairwise comparison of code responses with multi-dimensional quality rating. Annotators compare two code solutions to a programming task, select which is better, rate both on correctness, efficiency, readability, completeness, and instruction following, then justify their preference.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# CodeUltraFeedback: Code Preference Evaluation
# Based on "CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences" (Weyssow et al., arXiv 2024)

annotation_task_name: "CodeUltraFeedback Code Preference"
task_dir: "."
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1100px; margin: 0 auto;">
    <div style="background: #dbeafe; border-left: 4px solid #2563eb; padding: 14px 18px; margin-bottom: 16px; border-radius: 4px;">
      <h3 style="margin: 0 0 6px 0; color: #1e3a8a;">Coding Instruction</h3>
      <p style="margin: 0; color: #1e293b; font-size: 15px; line-height: 1.5;">{{text}}</p>
      <span style="display: inline-block; margin-top: 8px; background: #1e3a8a; color: #fff; padding: 2px 10px; border-radius: 12px; font-size: 12px;">{{language}}</span>
    </div>

    <div style="display: flex; gap: 16px; margin-bottom: 16px;">
      <div style="flex: 1; background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px; overflow-x: auto;">
        <h4 style="margin: 0 0 10px 0; color: #166534;">Response A</h4>
        <pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_a}}</pre>
      </div>
      <div style="flex: 1; background: #faf5ff; border: 1px solid #e9d5ff; border-radius: 8px; padding: 16px; overflow-x: auto;">
        <h4 style="margin: 0 0 10px 0; color: #6b21a8;">Response B</h4>
        <pre style="margin: 0; font-family: 'Courier New', monospace; font-size: 13px; line-height: 1.5; white-space: pre-wrap; color: #1e293b;">{{response_b}}</pre>
      </div>
    </div>

    <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 14px 18px;">
      <p style="margin: 0; color: #64748b; font-size: 13px;">Compare both responses and rate their quality on each dimension below. Then select your overall preference.</p>
    </div>
  </div>

annotation_schemes:
  - name: "preference"
    annotation_type: pairwise
    description: "Which response is better overall?"
    labels:
      - "Response A is better"
      - "Response B is better"
      - "Tie — roughly equal quality"

  - name: "quality_dimensions"
    annotation_type: multirate
    description: "Rate the PREFERRED response on each dimension."
    labels:
      - "1 - Very Poor"
      - "2 - Poor"
      - "3 - Average"
      - "4 - Good"
      - "5 - Excellent"
    options:
      - "Correctness"
      - "Efficiency"
      - "Readability"
      - "Completeness"
      - "Instruction Following"

  - name: "preference_rationale"
    annotation_type: text
    description: "Explain why you prefer one response over the other."

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "cuf-001",
    "text": "Write a Python function that implements binary search on a sorted list. The function should return the index of the target element, or -1 if not found.",
    "language": "python",
    "response_a": "def binary_search(arr, target):\n    left, right = 0, len(arr) - 1\n    while left <= right:\n        mid = left + (right - left) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            left = mid + 1\n        else:\n            right = mid - 1\n    return -1",
    "response_b": "def binary_search(arr, target):\n    for i in range(len(arr)):\n        if arr[i] == target:\n            return i\n    return -1"
  },
  {
    "id": "cuf-002",
    "text": "Create a JavaScript function that debounces another function, ensuring it only executes after a specified delay since the last call.",
    "language": "javascript",
    "response_a": "function debounce(fn, delay) {\n    let timer;\n    return function(...args) {\n        clearTimeout(timer);\n        timer = setTimeout(() => fn.apply(this, args), delay);\n    };\n}",
    "response_b": "function debounce(func, wait) {\n    let timeoutId = null;\n    let lastArgs = null;\n\n    function debounced(...args) {\n        lastArgs = args;\n        if (timeoutId !== null) {\n            clearTimeout(timeoutId);\n        }\n        timeoutId = setTimeout(() => {\n            func.apply(this, lastArgs);\n            timeoutId = null;\n            lastArgs = null;\n        }, wait);\n    }\n\n    debounced.cancel = function() {\n        if (timeoutId !== null) {\n            clearTimeout(timeoutId);\n            timeoutId = null;\n            lastArgs = null;\n        }\n    };\n\n    debounced.flush = function() {\n        if (timeoutId !== null) {\n            clearTimeout(timeoutId);\n            func.apply(this, lastArgs);\n            timeoutId = null;\n            lastArgs = null;\n        }\n    };\n\n    return debounced;\n}"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/codeultrafeedback-code-preference
potato start config.yaml

Details

Annotation Types

pairwisemultiratetext

Domain

Code GenerationSoftware Engineering

Use Cases

RLHFCode Quality Assessment

Tags

codepairwise-comparisonpreference-learningllm-alignmentsoftware-engineering

Found an issue or want to improve this design?

Open an Issue