CodePRM Code Process Reward
Process reward annotation for step-by-step code generation with execution feedback. Annotators verify each incremental code generation step against problem requirements, check syntactic validity, and provide feedback informed by test execution results.
Configuration Fileconfig.yaml
# CodePRM Code Process Reward
# Based on "CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation" (Li et al., ACL 2025 Findings)
annotation_task_name: "CodePRM Code Process Reward"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="max-width: 850px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
<div style="background: #f0f4ff; border: 1px solid #b0c4de; border-radius: 8px; padding: 18px 22px; margin-bottom: 18px;">
<h3 style="margin: 0 0 10px 0; color: #2c3e6b; font-size: 16px;">Problem Description</h3>
<div style="font-size: 14px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
</div>
<div style="margin-bottom: 18px;">
<h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Incremental Code Steps</h3>
<div style="background: #fafafa; border: 1px solid #e0e0e0; border-radius: 6px; padding: 14px 18px; font-family: 'Courier New', monospace; font-size: 13px; white-space: pre-wrap; line-height: 1.6;">{{steps}}</div>
</div>
<div style="background: #1a1a2e; border-radius: 8px; padding: 16px 20px; margin-bottom: 10px;">
<h3 style="margin: 0 0 8px 0; color: #00e676; font-size: 14px; font-family: 'Courier New', monospace;">$ python run_tests.py</h3>
<div style="font-family: 'Courier New', monospace; font-size: 13px; color: #00e676; white-space: pre-wrap; line-height: 1.5;">{{test_output}}</div>
</div>
</div>
annotation_schemes:
- name: "step_correctness"
annotation_type: radio
description: "Verify each code generation step against the problem requirements."
labels:
- "Correct — code step is valid"
- "Neutral — step doesn't affect outcome"
- "Incorrect — step introduces a bug"
keyboard_shortcuts:
"Correct — code step is valid": "1"
"Neutral — step doesn't affect outcome": "2"
"Incorrect — step introduces a bug": "3"
- name: "code_compiles"
annotation_type: radio
description: "Is the code in this step syntactically valid?"
labels:
- "Yes — code is syntactically valid"
- "No — syntax error present"
- "N/A — not a code step"
keyboard_shortcuts:
"Yes — code is syntactically valid": "4"
"No — syntax error present": "5"
"N/A — not a code step": "6"
- name: "step_feedback"
annotation_type: text
description: "Provide feedback on the current step."
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "codeprm-001",
"text": "Write a function `merge_sorted(a, b)` that merges two sorted lists into one sorted list without using built-in sort functions.",
"steps": "Step 1: Define the function signature and initialize pointers.\n def merge_sorted(a, b):\n result = []\n i, j = 0, 0\n\nStep 2: Add the main merge loop comparing elements from both lists.\n while i < len(a) and j < len(b):\n if a[i] <= b[j]:\n result.append(a[i])\n i += 1\n else:\n result.append(b[j])\n j += 1\n\nStep 3: Append remaining elements from both lists.\n result.extend(a[i:])\n result.extend(b[j:])\n return result",
"test_output": "test_merge_empty_lists .............. PASS\ntest_merge_single_elements .......... PASS\ntest_merge_different_lengths ........ PASS\ntest_merge_duplicates ............... PASS\n\n4 passed, 0 failed"
},
{
"id": "codeprm-002",
"text": "Implement a function `is_balanced(s)` that checks whether a string of brackets '()[]{}' is balanced.",
"steps": "Step 1: Define the function and the bracket mappings.\n def is_balanced(s):\n stack = []\n mapping = {')': '(', ']': '[', '}': '{'}\n\nStep 2: Iterate through each character and push opening brackets.\n for char in s:\n if char in mapping.values():\n stack.append(char)\n\nStep 3: For closing brackets, check the stack.\n elif char in mapping:\n if not stack or stack.pop() != mapping[char]:\n return False\n\nStep 4: Return whether the stack is empty.\n return len(stack) == 0",
"test_output": "test_empty_string ................... PASS\ntest_simple_balanced ................ PASS\ntest_nested_balanced ................ PASS\ntest_unbalanced_open ................ PASS\ntest_unbalanced_close ............... PASS\ntest_mixed_balanced ................. PASS\n\n6 passed, 0 failed"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/codeprm-code-process-reward potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
BigCodeBench Human Baseline Evaluation
Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.
Code Generation Evaluation (HumanEval)
Evaluation of LLM-generated code based on the HumanEval benchmark. Annotators assess functional correctness, code quality, and efficiency of generated Python functions, and provide explanations of errors and improvement suggestions, supporting research in code generation and LLM evaluation.
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.