CodePRM Code Process Reward

Process reward annotation for step-by-step code generation with execution feedback. Annotators verify each incremental code generation step against problem requirements, check syntactic validity, and provide feedback informed by test execution results.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# CodePRM Code Process Reward
# Based on "CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation" (Li et al., ACL 2025 Findings)

annotation_task_name: "CodePRM Code Process Reward"
task_dir: "."
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="max-width: 850px; margin: 0 auto; font-family: 'Segoe UI', Arial, sans-serif;">
    <div style="background: #f0f4ff; border: 1px solid #b0c4de; border-radius: 8px; padding: 18px 22px; margin-bottom: 18px;">
      <h3 style="margin: 0 0 10px 0; color: #2c3e6b; font-size: 16px;">Problem Description</h3>
      <div style="font-size: 14px; color: #2c3e50; line-height: 1.6;">{{text}}</div>
    </div>
    <div style="margin-bottom: 18px;">
      <h3 style="margin: 0 0 12px 0; color: #2c3e50; font-size: 15px;">Incremental Code Steps</h3>
      <div style="background: #fafafa; border: 1px solid #e0e0e0; border-radius: 6px; padding: 14px 18px; font-family: 'Courier New', monospace; font-size: 13px; white-space: pre-wrap; line-height: 1.6;">{{steps}}</div>
    </div>
    <div style="background: #1a1a2e; border-radius: 8px; padding: 16px 20px; margin-bottom: 10px;">
      <h3 style="margin: 0 0 8px 0; color: #00e676; font-size: 14px; font-family: 'Courier New', monospace;">$ python run_tests.py</h3>
      <div style="font-family: 'Courier New', monospace; font-size: 13px; color: #00e676; white-space: pre-wrap; line-height: 1.5;">{{test_output}}</div>
    </div>
  </div>

annotation_schemes:
  - name: "step_correctness"
    annotation_type: radio
    description: "Verify each code generation step against the problem requirements."
    labels:
      - "Correct — code step is valid"
      - "Neutral — step doesn't affect outcome"
      - "Incorrect — step introduces a bug"
    keyboard_shortcuts:
      "Correct — code step is valid": "1"
      "Neutral — step doesn't affect outcome": "2"
      "Incorrect — step introduces a bug": "3"

  - name: "code_compiles"
    annotation_type: radio
    description: "Is the code in this step syntactically valid?"
    labels:
      - "Yes — code is syntactically valid"
      - "No — syntax error present"
      - "N/A — not a code step"
    keyboard_shortcuts:
      "Yes — code is syntactically valid": "4"
      "No — syntax error present": "5"
      "N/A — not a code step": "6"

  - name: "step_feedback"
    annotation_type: text
    description: "Provide feedback on the current step."

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "codeprm-001",
    "text": "Write a function `merge_sorted(a, b)` that merges two sorted lists into one sorted list without using built-in sort functions.",
    "steps": "Step 1: Define the function signature and initialize pointers.\n  def merge_sorted(a, b):\n      result = []\n      i, j = 0, 0\n\nStep 2: Add the main merge loop comparing elements from both lists.\n      while i < len(a) and j < len(b):\n          if a[i] <= b[j]:\n              result.append(a[i])\n              i += 1\n          else:\n              result.append(b[j])\n              j += 1\n\nStep 3: Append remaining elements from both lists.\n      result.extend(a[i:])\n      result.extend(b[j:])\n      return result",
    "test_output": "test_merge_empty_lists .............. PASS\ntest_merge_single_elements .......... PASS\ntest_merge_different_lengths ........ PASS\ntest_merge_duplicates ............... PASS\n\n4 passed, 0 failed"
  },
  {
    "id": "codeprm-002",
    "text": "Implement a function `is_balanced(s)` that checks whether a string of brackets '()[]{}' is balanced.",
    "steps": "Step 1: Define the function and the bracket mappings.\n  def is_balanced(s):\n      stack = []\n      mapping = {')': '(', ']': '[', '}': '{'}\n\nStep 2: Iterate through each character and push opening brackets.\n      for char in s:\n          if char in mapping.values():\n              stack.append(char)\n\nStep 3: For closing brackets, check the stack.\n          elif char in mapping:\n              if not stack or stack.pop() != mapping[char]:\n                  return False\n\nStep 4: Return whether the stack is empty.\n      return len(stack) == 0",
    "test_output": "test_empty_string ................... PASS\ntest_simple_balanced ................ PASS\ntest_nested_balanced ................ PASS\ntest_unbalanced_open ................ PASS\ntest_unbalanced_close ............... PASS\ntest_mixed_balanced ................. PASS\n\n6 passed, 0 failed"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/codeprm-code-process-reward
potato start config.yaml

Dataset & paper

Li et al., ACL 2025 Findings

Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{li2025codeprm, title={CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation}, author={Li, Qingyao and Dai, Xinyi and Li, Xiangyang and Zhang, Weinan and Wang, Yasheng and Tang, Ruiming and Yu, Yong}, booktitle={Findings of the Association for Computational Linguistics: ACL 2025}, year={2025}}

Details

Annotation Types

radiotext

Domain

Code GenerationProgramming

Use Cases

Process Reward ModelsCode VerificationRLHF

Related Designs

BigCodeBench Human Baseline Evaluation

Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.

radiolikert

DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.

multirateradio

HumanEval: Python Code Generation Benchmark

HumanEval is OpenAI's set of 164 hand-written Python problems that test whether generated code runs and passes unit tests. This Potato config has annotators rate functional correctness and quality of model-written functions.