HumanEval Code Correctness Evaluation
Evaluation of code generation correctness based on the HumanEval benchmark (Chen et al., arXiv 2021). Annotators assess whether AI-generated code solutions are correct, provide code review comments, and rate code quality on a numeric scale.
Configuration Fileconfig.yaml
# HumanEval Code Correctness Evaluation
# Based on Chen et al., arXiv 2021
# Paper: https://arxiv.org/abs/2107.03374
# Dataset: https://github.com/openai/human-eval
#
# Evaluation of AI-generated code solutions for correctness and quality.
# Each item presents a programming task description and a candidate
# solution. Annotators judge correctness, write review comments, and
# assign a numeric quality score.
#
# Correctness Labels:
# - Correct: The solution passes all expected test cases
# - Partially Correct: The solution handles some cases but fails on edge cases
# - Incorrect: The solution produces wrong output or has logical errors
# - Runtime Error: The solution would crash or raise an exception
#
# Annotation Guidelines:
# 1. Read the task description carefully, noting input/output requirements
# 2. Review the code solution for correctness and edge case handling
# 3. Select the correctness label
# 4. Write code review comments noting any issues
# 5. Assign a code quality score from 1 to 10
annotation_task_name: "HumanEval Code Correctness Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Correctness judgment
- annotation_type: radio
name: correctness
description: "Is this code solution correct?"
labels:
- "Correct"
- "Partially Correct"
- "Incorrect"
- "Runtime Error"
keyboard_shortcuts:
"Correct": "1"
"Partially Correct": "2"
"Incorrect": "3"
"Runtime Error": "4"
tooltips:
"Correct": "The solution passes all expected test cases including edge cases"
"Partially Correct": "The solution handles some cases but fails on edge cases or special inputs"
"Incorrect": "The solution produces wrong output or has fundamental logical errors"
"Runtime Error": "The solution would crash, raise an exception, or enter an infinite loop"
# Step 2: Code review comments
- annotation_type: text
name: review_comments
description: "Provide code review comments noting any issues, bugs, or improvements."
textarea: true
required: false
placeholder: "Describe any issues, bugs, or suggested improvements..."
# Step 3: Code quality score
- annotation_type: number
name: quality_score
description: "Code quality score (1-10)"
annotation_instructions: |
You will evaluate AI-generated code solutions from the HumanEval benchmark.
For each item:
1. Read the task description carefully, noting expected inputs and outputs.
2. Review the code solution line by line.
3. Judge whether the solution is Correct, Partially Correct, Incorrect, or would cause a Runtime Error.
4. Write code review comments noting any bugs, edge case failures, or improvements.
5. Assign a quality score from 1 (very poor) to 10 (excellent).
Consider:
- Does the solution handle edge cases (empty input, large values, negative numbers)?
- Is the logic correct for all valid inputs?
- Would the code actually run without errors?
- Is the code readable and well-structured?
html_layout: |
<div style="padding: 15px; max-width: 900px; margin: auto;">
<div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px; margin-bottom: 16px;">
<strong>Language:</strong> {{language}}
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Task Description:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="background: #1e1e1e; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #9cdcfe;">Code Solution:</strong>
<pre style="color: #d4d4d4; font-family: 'Consolas', 'Courier New', monospace; font-size: 14px; line-height: 1.6; margin: 8px 0 0 0; white-space: pre-wrap; overflow-x: auto;">{{code_solution}}</pre>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "humaneval_001",
"text": "Write a function that takes a list of integers and returns the sum of all even numbers in the list.",
"code_solution": "def sum_even(numbers):\n return sum(n for n in numbers if n % 2 == 0)",
"language": "Python"
},
{
"id": "humaneval_002",
"text": "Write a function that checks if a given string is a palindrome. The function should be case-insensitive and ignore non-alphanumeric characters.",
"code_solution": "def is_palindrome(s):\n cleaned = ''.join(c.lower() for c in s if c.isalnum())\n return cleaned == cleaned[::-1]",
"language": "Python"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/humaneval-code-correctness potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
BIG-Bench Task Evaluation
Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.
GPQA - Graduate-Level Expert QA Evaluation
Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., ICLR 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.
NumEval - Numeral-Aware Language Understanding
Numeral-aware language understanding task requiring annotators to predict numerical values from text, classify numeral types, and provide explanations. Based on SemEval-2024 Task 7 (NumEval).