Skip to content
Showcase/Code Generation Evaluation (HumanEval)
intermediateevaluation

Code Generation Evaluation (HumanEval)

Evaluation of LLM-generated code based on the HumanEval benchmark. Annotators assess functional correctness, code quality, and efficiency of generated Python functions, and provide explanations of errors and improvement suggestions, supporting research in code generation and LLM evaluation.

Submit

Configuration Fileconfig.yaml

# Code Generation Evaluation (HumanEval)
# Based on Chen et al., arXiv 2021
#
# This configuration supports evaluation of LLM-generated Python code
# against function specifications from the HumanEval benchmark.
#
# Evaluation Dimensions:
# - Functional Correctness: Does the code produce correct output for all inputs?
# - Code Quality: Is the code well-structured, readable, and maintainable?
# - Efficiency: Is the algorithm efficient in terms of time and space complexity?
#
# Annotation Guidelines:
# 1. Read the prompt/docstring carefully to understand the expected behavior
# 2. Examine the generated code and mentally trace through test cases
# 3. Assess functional correctness: does the code handle all cases correctly?
# 4. Evaluate code quality: readability, structure, naming, error handling
# 5. Rate efficiency: time complexity, space complexity, unnecessary operations
# 6. If errors are found, describe them clearly in the error description field
# 7. Provide concrete improvement suggestions when applicable
#
# Common Issues to Watch For:
# - Off-by-one errors in loops and indexing
# - Missing edge cases (empty input, negative numbers, None values)
# - Incorrect type handling or implicit type conversions
# - Inefficient algorithms where better approaches exist
# - Code that works for examples but fails on edge cases

annotation_task_name: "Code Generation Evaluation (HumanEval)"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "prompt"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  # Step 1: Functional correctness
  - annotation_type: radio
    name: functional_correctness
    description: "Does the generated code correctly implement the specified function?"
    labels:
      - "correct"
      - "partially-correct"
      - "incorrect"
      - "cannot-determine"
    tooltips:
      "correct": "Code produces correct output for all inputs including edge cases"
      "partially-correct": "Code works for common cases but fails on edge cases or specific inputs"
      "incorrect": "Code produces wrong output, crashes, or does not implement the specification"
      "cannot-determine": "Specification is ambiguous or code behavior cannot be determined without running"

  # Step 2: Code quality rating
  - annotation_type: radio
    name: code_quality
    description: "How would you rate the overall quality of the generated code?"
    labels:
      - "excellent"
      - "good"
      - "acceptable"
      - "poor"
      - "very-poor"
    tooltips:
      "excellent": "Clean, idiomatic, well-structured code with proper naming and documentation; production-ready"
      "good": "Well-written code with minor style issues; follows best practices for the most part"
      "acceptable": "Functional but could use improvement in readability, structure, or naming"
      "poor": "Hard to read, poorly structured, non-idiomatic, or contains bad practices"
      "very-poor": "Extremely messy, incomprehensible, or fundamentally flawed in approach"

  # Step 3: Efficiency rating
  - annotation_type: radio
    name: efficiency
    description: "How efficient is the generated code?"
    labels:
      - "optimal"
      - "acceptable"
      - "suboptimal"
      - "inefficient"
    tooltips:
      "optimal": "Uses the best known algorithm and data structures; cannot be meaningfully improved"
      "acceptable": "Reasonable approach; not the most efficient but adequate for typical inputs"
      "suboptimal": "Uses a less efficient approach where a clearly better one exists"
      "inefficient": "Unnecessarily slow or memory-intensive; would fail on large inputs"

  # Step 4: Error description
  - annotation_type: text
    name: error_description
    description: "Describe any errors found in the generated code. Include specific failing inputs if possible. Leave blank if code is correct."

  # Step 5: Improvement suggestion
  - annotation_type: text
    name: improvement_suggestion
    description: "How could the code be improved? Mention better algorithms, cleaner structure, or missing edge case handling."

annotation_instructions: |
  You are evaluating Python code generated by a language model.
  For each item:
  1. Read the prompt/docstring to understand what the function should do
  2. Examine the generated code below the prompt
  3. Rate the functional correctness (correct, partially correct, incorrect, or cannot determine)
  4. Rate the code quality (excellent to very poor)
  5. Rate the efficiency (optimal to inefficient)
  6. Describe any errors found with specific failing inputs
  7. Suggest improvements to the code

html_layout: |
  <div style="padding: 15px; font-family: sans-serif;">
    <div style="margin-bottom: 8px; color: #6b7280; font-size: 13px;">
      <strong>Task ID:</strong> {{task_id}} |
      <strong>Expected Behavior:</strong> {{expected_behavior}}
    </div>
    <div style="margin-bottom: 12px;">
      <div style="font-weight: bold; font-size: 14px; color: #374151; margin-bottom: 4px;">Prompt / Docstring:</div>
      <pre style="background: #1e293b; color: #e2e8f0; padding: 14px; border-radius: 8px; overflow-x: auto; font-size: 13px; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; border-left: 4px solid #3b82f6;">{{prompt}}</pre>
    </div>
    <div>
      <div style="font-weight: bold; font-size: 14px; color: #374151; margin-bottom: 4px;">Generated Code:</div>
      <pre style="background: #1e293b; color: #a5f3fc; padding: 14px; border-radius: 8px; overflow-x: auto; font-size: 13px; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; border-left: 4px solid #22c55e;">{{generated_code}}</pre>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "he_001",
    "prompt": "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\"Check if in given list of numbers, are any two numbers closer to each other\n    than given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"",
    "generated_code": "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False",
    "task_id": "HumanEval/0",
    "expected_behavior": "Return True if any two numbers in the list differ by less than threshold"
  },
  {
    "id": "he_002",
    "prompt": "def separate_paren_groups(paren_string: str) -> List[str]:\n    \"\"\"Input to this function is a string containing multiple groups of nested\n    parentheses. Your goal is to separate those groups into separate strings\n    and return the list of those.\n    >>> separate_paren_groups('( ) (( )) (( )( ))')\n    ['()', '(())', '(()())']\n    \"\"\"",
    "generated_code": "def separate_paren_groups(paren_string: str) -> List[str]:\n    result = []\n    current = ''\n    depth = 0\n    for char in paren_string:\n        if char == '(':\n            depth += 1\n            current += char\n        elif char == ')':\n            depth -= 1\n            current += char\n            if depth == 0:\n                result.append(current)\n                current = ''\n    return result",
    "task_id": "HumanEval/1",
    "expected_behavior": "Split string of nested parentheses into separate balanced groups, ignoring spaces"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/code-annotation/humaneval-code-generation
potato start config.yaml

Details

Annotation Types

radiotext

Domain

Code GenerationProgram SynthesisAI Evaluation

Use Cases

LLM EvaluationCode Correctness AssessmentCode Quality Rating

Tags

code-generationhumanevalprogram-synthesisllm-evaluationopenai

Found an issue or want to improve this design?

Open an Issue