CodePRM Code Process Reward
Process reward annotation for step-by-step code generation with execution feedback. Annotators verify each incremental code generation step against problem requirements, check syntactic validity, and provide feedback informed by test execution results.
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# CodePRM Code Process Reward
# Based on "CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation" (Li et al., ACL 2025 Findings)
annotation_task_name: "CodePRM Code Process Reward"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="max-width: 850px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
<div style="background: #f0f4ff; border: 1px solid #b0c4de; border-radius: 8px; padding: 18px 22px; margin-bottom: 18px;">
<h3 style="margin: 0 0 10px 0; color: #2c3e6b; font-size: 16px;">Problem Description</h3>
<div style="font-size: 14px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
</div>
<div style="margin-bottom: 18px;">
<h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Incremental Code Steps</h3>
<div style="background: #fafafa; border: 1px solid #e0e0e0; border-radius: 6px; padding: 14px 18px; font-family: 'Courier New', monospace; font-size: 13px; white-space: pre-wrap; line-height: 1.6;">{{steps}}</div>
</div>
<div style="background: #1a1a2e; border-radius: 8px; padding: 16px 20px; margin-bottom: 10px;">
<h3 style="margin: 0 0 8px 0; color: #00e676; font-size: 14px; font-family: 'Courier New', monospace;">$ python run_tests.py</h3>
<div style="font-family: 'Courier New', monospace; font-size: 13px; color: #00e676; white-space: pre-wrap; line-height: 1.5;">{{test_output}}</div>
</div>
</div>
annotation_schemes:
- name: "step_correctness"
annotation_type: radio
description: "Verify each code generation step against the problem requirements."
labels:
- "Correct — code step is valid"
- "Neutral — step doesn't affect outcome"
- "Incorrect — step introduces a bug"
keyboard_shortcuts:
"Correct — code step is valid": "1"
"Neutral — step doesn't affect outcome": "2"
"Incorrect — step introduces a bug": "3"
- name: "code_compiles"
annotation_type: radio
description: "Is the code in this step syntactically valid?"
labels:
- "Yes — code is syntactically valid"
- "No — syntax error present"
- "N/A — not a code step"
keyboard_shortcuts:
"Yes — code is syntactically valid": "4"
"No — syntax error present": "5"
"N/A — not a code step": "6"
- name: "step_feedback"
annotation_type: text
description: "Provide feedback on the current step."
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "codeprm-001",
"text": "Write a function `merge_sorted(a, b)` that merges two sorted lists into one sorted list without using built-in sort functions.",
"steps": "Step 1: Define the function signature and initialize pointers.\n def merge_sorted(a, b):\n result = []\n i, j = 0, 0\n\nStep 2: Add the main merge loop comparing elements from both lists.\n while i < len(a) and j < len(b):\n if a[i] <= b[j]:\n result.append(a[i])\n i += 1\n else:\n result.append(b[j])\n j += 1\n\nStep 3: Append remaining elements from both lists.\n result.extend(a[i:])\n result.extend(b[j:])\n return result",
"test_output": "test_merge_empty_lists .............. PASS\ntest_merge_single_elements .......... PASS\ntest_merge_different_lengths ........ PASS\ntest_merge_duplicates ............... PASS\n\n4 passed, 0 failed"
},
{
"id": "codeprm-002",
"text": "Implement a function `is_balanced(s)` that checks whether a string of brackets '()[]{}' is balanced.",
"steps": "Step 1: Define the function and the bracket mappings.\n def is_balanced(s):\n stack = []\n mapping = {')': '(', ']': '[', '}': '{'}\n\nStep 2: Iterate through each character and push opening brackets.\n for char in s:\n if char in mapping.values():\n stack.append(char)\n\nStep 3: For closing brackets, check the stack.\n elif char in mapping:\n if not stack or stack.pop() != mapping[char]:\n return False\n\nStep 4: Return whether the stack is empty.\n return len(stack) == 0",
"test_output": "test_empty_string ................... PASS\ntest_simple_balanced ................ PASS\ntest_nested_balanced ................ PASS\ntest_unbalanced_open ................ PASS\ntest_unbalanced_close ............... PASS\ntest_mixed_balanced ................. PASS\n\n6 passed, 0 failed"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/codeprm-code-process-reward potato start config.yaml
Dataset & paper
Li et al., ACL 2025 Findings
Citation (BibTeX)
@inproceedings{li2025codeprm, title={CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation}, author={Li, Qingyao and Dai, Xinyi and Li, Xiangyang and Zhang, Weinan and Wang, Yasheng and Tang, Ruiming and Yu, Yong}, booktitle={Findings of the Association for Computational Linguistics: ACL 2025}, year={2025}}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
BigCodeBench Human Baseline Evaluation
Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.
HumanEval: Python Code Generation Benchmark
HumanEval is OpenAI's set of 164 hand-written Python problems that test whether generated code runs and passes unit tests. This Potato config has annotators rate functional correctness and quality of model-written functions.