Code Generation Evaluation (HumanEval)
Evaluation of LLM-generated code based on the HumanEval benchmark. Annotators assess functional correctness, code quality, and efficiency of generated Python functions, and provide explanations of errors and improvement suggestions, supporting research in code generation and LLM evaluation.
Configuration Fileconfig.yaml
# Code Generation Evaluation (HumanEval)
# Based on Chen et al., arXiv 2021
#
# This configuration supports evaluation of LLM-generated Python code
# against function specifications from the HumanEval benchmark.
#
# Evaluation Dimensions:
# - Functional Correctness: Does the code produce correct output for all inputs?
# - Code Quality: Is the code well-structured, readable, and maintainable?
# - Efficiency: Is the algorithm efficient in terms of time and space complexity?
#
# Annotation Guidelines:
# 1. Read the prompt/docstring carefully to understand the expected behavior
# 2. Examine the generated code and mentally trace through test cases
# 3. Assess functional correctness: does the code handle all cases correctly?
# 4. Evaluate code quality: readability, structure, naming, error handling
# 5. Rate efficiency: time complexity, space complexity, unnecessary operations
# 6. If errors are found, describe them clearly in the error description field
# 7. Provide concrete improvement suggestions when applicable
#
# Common Issues to Watch For:
# - Off-by-one errors in loops and indexing
# - Missing edge cases (empty input, negative numbers, None values)
# - Incorrect type handling or implicit type conversions
# - Inefficient algorithms where better approaches exist
# - Code that works for examples but fails on edge cases
annotation_task_name: "Code Generation Evaluation (HumanEval)"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "prompt"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
# Step 1: Functional correctness
- annotation_type: radio
name: functional_correctness
description: "Does the generated code correctly implement the specified function?"
labels:
- "correct"
- "partially-correct"
- "incorrect"
- "cannot-determine"
tooltips:
"correct": "Code produces correct output for all inputs including edge cases"
"partially-correct": "Code works for common cases but fails on edge cases or specific inputs"
"incorrect": "Code produces wrong output, crashes, or does not implement the specification"
"cannot-determine": "Specification is ambiguous or code behavior cannot be determined without running"
# Step 2: Code quality rating
- annotation_type: radio
name: code_quality
description: "How would you rate the overall quality of the generated code?"
labels:
- "excellent"
- "good"
- "acceptable"
- "poor"
- "very-poor"
tooltips:
"excellent": "Clean, idiomatic, well-structured code with proper naming and documentation; production-ready"
"good": "Well-written code with minor style issues; follows best practices for the most part"
"acceptable": "Functional but could use improvement in readability, structure, or naming"
"poor": "Hard to read, poorly structured, non-idiomatic, or contains bad practices"
"very-poor": "Extremely messy, incomprehensible, or fundamentally flawed in approach"
# Step 3: Efficiency rating
- annotation_type: radio
name: efficiency
description: "How efficient is the generated code?"
labels:
- "optimal"
- "acceptable"
- "suboptimal"
- "inefficient"
tooltips:
"optimal": "Uses the best known algorithm and data structures; cannot be meaningfully improved"
"acceptable": "Reasonable approach; not the most efficient but adequate for typical inputs"
"suboptimal": "Uses a less efficient approach where a clearly better one exists"
"inefficient": "Unnecessarily slow or memory-intensive; would fail on large inputs"
# Step 4: Error description
- annotation_type: text
name: error_description
description: "Describe any errors found in the generated code. Include specific failing inputs if possible. Leave blank if code is correct."
# Step 5: Improvement suggestion
- annotation_type: text
name: improvement_suggestion
description: "How could the code be improved? Mention better algorithms, cleaner structure, or missing edge case handling."
annotation_instructions: |
You are evaluating Python code generated by a language model.
For each item:
1. Read the prompt/docstring to understand what the function should do
2. Examine the generated code below the prompt
3. Rate the functional correctness (correct, partially correct, incorrect, or cannot determine)
4. Rate the code quality (excellent to very poor)
5. Rate the efficiency (optimal to inefficient)
6. Describe any errors found with specific failing inputs
7. Suggest improvements to the code
html_layout: |
<div style="padding: 15px; font-family: sans-serif;">
<div style="margin-bottom: 8px; color: #6b7280; font-size: 13px;">
<strong>Task ID:</strong> {{task_id}} |
<strong>Expected Behavior:</strong> {{expected_behavior}}
</div>
<div style="margin-bottom: 12px;">
<div style="font-weight: bold; font-size: 14px; color: #374151; margin-bottom: 4px;">Prompt / Docstring:</div>
<pre style="background: #1e293b; color: #e2e8f0; padding: 14px; border-radius: 8px; overflow-x: auto; font-size: 13px; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; border-left: 4px solid #3b82f6;">{{prompt}}</pre>
</div>
<div>
<div style="font-weight: bold; font-size: 14px; color: #374151; margin-bottom: 4px;">Generated Code:</div>
<pre style="background: #1e293b; color: #a5f3fc; padding: 14px; border-radius: 8px; overflow-x: auto; font-size: 13px; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; border-left: 4px solid #22c55e;">{{generated_code}}</pre>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "he_001",
"prompt": "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\"Check if in given list of numbers, are any two numbers closer to each other\n than given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"",
"generated_code": "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n for i in range(len(numbers)):\n for j in range(i + 1, len(numbers)):\n if abs(numbers[i] - numbers[j]) < threshold:\n return True\n return False",
"task_id": "HumanEval/0",
"expected_behavior": "Return True if any two numbers in the list differ by less than threshold"
},
{
"id": "he_002",
"prompt": "def separate_paren_groups(paren_string: str) -> List[str]:\n \"\"\"Input to this function is a string containing multiple groups of nested\n parentheses. Your goal is to separate those groups into separate strings\n and return the list of those.\n >>> separate_paren_groups('( ) (( )) (( )( ))')\n ['()', '(())', '(()())']\n \"\"\"",
"generated_code": "def separate_paren_groups(paren_string: str) -> List[str]:\n result = []\n current = ''\n depth = 0\n for char in paren_string:\n if char == '(':\n depth += 1\n current += char\n elif char == ')':\n depth -= 1\n current += char\n if depth == 0:\n result.append(current)\n current = ''\n return result",
"task_id": "HumanEval/1",
"expected_behavior": "Split string of nested parentheses into separate balanced groups, ignoring spaces"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/text/code-annotation/humaneval-code-generation potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
BIG-Bench Task Evaluation
Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.
IFEval: Instruction-Following Evaluation for LLMs
Evaluate how well large language models follow verifiable instructions with specific constraints such as word count limits, formatting requirements, keyword inclusion, and structural rules. Annotators assess pass/fail per constraint and overall response quality.
MMLU Knowledge Evaluation
Multiple-choice knowledge evaluation across diverse academic subjects, based on the Massive Multitask Language Understanding benchmark (Hendrycks et al., ICLR 2021). Annotators select the correct answer from four options and provide an explanation.