Showcase/HumanEval: Python Code Generation Benchmark

intermediateevaluation

HumanEval: Python Code Generation Benchmark

HumanEval is OpenAI's set of 164 hand-written Python problems that test whether generated code runs and passes unit tests. This Potato config has annotators rate functional correctness and quality of model-written functions.

About this dataset

HumanEval was released by OpenAI in the 2021 paper "Evaluating Large Language Models Trained on Code" by Mark Chen and colleagues, the work that introduced Codex. It became a standard test for whether a language model can turn a docstring into working Python.

The benchmark holds 164 hand-written programming problems. Each problem gives a function signature, a docstring describing the task, a reference solution, and a set of hidden unit tests, with an average of 7.7 tests per problem. The problems were written by hand to avoid overlap with code scraped from public repositories.

Scoring uses the pass@k metric. A model generates k candidate completions for each problem, and the problem counts as solved if any candidate passes every unit test. This rewards functional correctness, the code running and returning right answers, rather than surface similarity to a reference. In the original paper Codex reached 28.8% pass@1 on the set.

The Potato config below reproduces a human-in-the-loop version of this evaluation: annotators read each prompt and a generated function, judge whether it is correct, rate code quality, and write explanations of errors and suggested fixes.

Problems: 164 hand-written
Language: Python
Avg unit tests per problem: 7.7
Per problem: Signature, docstring, solution, tests
Metric: pass@k functional correctness
Released: 2021 (OpenAI, Codex paper)

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# Code Generation Evaluation (HumanEval)
# Based on Chen et al., arXiv 2021
#
# This configuration supports evaluation of LLM-generated Python code
# against function specifications from the HumanEval benchmark.
#
# Evaluation Dimensions:
# - Functional Correctness: Does the code produce correct output for all inputs?
# - Code Quality: Is the code well-structured, readable, and maintainable?
# - Efficiency: Is the algorithm efficient in terms of time and space complexity?
#
# Annotation Guidelines:
# 1. Read the prompt/docstring carefully to understand the expected behavior
# 2. Examine the generated code and mentally trace through test cases
# 3. Assess functional correctness: does the code handle all cases correctly?
# 4. Evaluate code quality: readability, structure, naming, error handling
# 5. Rate efficiency: time complexity, space complexity, unnecessary operations
# 6. If errors are found, describe them clearly in the error description field
# 7. Provide concrete improvement suggestions when applicable
#
# Common Issues to Watch For:
# - Off-by-one errors in loops and indexing
# - Missing edge cases (empty input, negative numbers, None values)
# - Incorrect type handling or implicit type conversions
# - Inefficient algorithms where better approaches exist
# - Code that works for examples but fails on edge cases

annotation_task_name: "Code Generation Evaluation (HumanEval)"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "prompt"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  # Step 1: Functional correctness
  - annotation_type: radio
    name: functional_correctness
    description: "Does the generated code correctly implement the specified function?"
    labels:
      - "correct"
      - "partially-correct"
      - "incorrect"
      - "cannot-determine"
    tooltips:
      "correct": "Code produces correct output for all inputs including edge cases"
      "partially-correct": "Code works for common cases but fails on edge cases or specific inputs"
      "incorrect": "Code produces wrong output, crashes, or does not implement the specification"
      "cannot-determine": "Specification is ambiguous or code behavior cannot be determined without running"

  # Step 2: Code quality rating
  - annotation_type: radio
    name: code_quality
    description: "How would you rate the overall quality of the generated code?"
    labels:
      - "excellent"
      - "good"
      - "acceptable"
      - "poor"
      - "very-poor"
    tooltips:
      "excellent": "Clean, idiomatic, well-structured code with proper naming and documentation; production-ready"
      "good": "Well-written code with minor style issues; follows best practices for the most part"
      "acceptable": "Functional but could use improvement in readability, structure, or naming"
      "poor": "Hard to read, poorly structured, non-idiomatic, or contains bad practices"
      "very-poor": "Extremely messy, incomprehensible, or fundamentally flawed in approach"

  # Step 3: Efficiency rating
  - annotation_type: radio
    name: efficiency
    description: "How efficient is the generated code?"
    labels:
      - "optimal"
      - "acceptable"
      - "suboptimal"
      - "inefficient"
    tooltips:
      "optimal": "Uses the best known algorithm and data structures; cannot be meaningfully improved"
      "acceptable": "Reasonable approach; not the most efficient but adequate for typical inputs"
      "suboptimal": "Uses a less efficient approach where a clearly better one exists"
      "inefficient": "Unnecessarily slow or memory-intensive; would fail on large inputs"

  # Step 4: Error description
  - annotation_type: text
    name: error_description
    description: "Describe any errors found in the generated code. Include specific failing inputs if possible. Leave blank if code is correct."

  # Step 5: Improvement suggestion
  - annotation_type: text
    name: improvement_suggestion
    description: "How could the code be improved? Mention better algorithms, cleaner structure, or missing edge case handling."

annotation_instructions: |
  You are evaluating Python code generated by a language model.
  For each item:
  1. Read the prompt/docstring to understand what the function should do
  2. Examine the generated code below the prompt
  3. Rate the functional correctness (correct, partially correct, incorrect, or cannot determine)
  4. Rate the code quality (excellent to very poor)
  5. Rate the efficiency (optimal to inefficient)
  6. Describe any errors found with specific failing inputs
  7. Suggest improvements to the code

html_layout: |
  <div style="padding: 15px; font-family: sans-serif;">
    <div style="margin-bottom: 8px; color: #6b7280; font-size: 13px;">
      <strong>Task ID:</strong> {{task_id}} |
      <strong>Expected Behavior:</strong> {{expected_behavior}}
    </div>
    <div style="margin-bottom: 12px;">
      <div style="font-weight: bold; font-size: 14px; color: #374151; margin-bottom: 4px;">Prompt / Docstring:</div>
      <pre style="background: #1e293b; color: #e2e8f0; padding: 14px; border-radius: 8px; overflow-x: auto; font-size: 13px; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; border-left: 4px solid #3b82f6;">{{prompt}}</pre>
    </div>
    <div>
      <div style="font-weight: bold; font-size: 14px; color: #374151; margin-bottom: 4px;">Generated Code:</div>
      <pre style="background: #1e293b; color: #a5f3fc; padding: 14px; border-radius: 8px; overflow-x: auto; font-size: 13px; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; border-left: 4px solid #22c55e;">{{generated_code}}</pre>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "he_001",
    "prompt": "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\"Check if in given list of numbers, are any two numbers closer to each other\n    than given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"",
    "generated_code": "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False",
    "task_id": "HumanEval/0",
    "expected_behavior": "Return True if any two numbers in the list differ by less than threshold"
  },
  {
    "id": "he_002",
    "prompt": "def separate_paren_groups(paren_string: str) -> List[str]:\n    \"\"\"Input to this function is a string containing multiple groups of nested\n    parentheses. Your goal is to separate those groups into separate strings\n    and return the list of those.\n    >>> separate_paren_groups('( ) (( )) (( )( ))')\n    ['()', '(())', '(()())']\n    \"\"\"",
    "generated_code": "def separate_paren_groups(paren_string: str) -> List[str]:\n    result = []\n    current = ''\n    depth = 0\n    for char in paren_string:\n        if char == '(':\n            depth += 1\n            current += char\n        elif char == ')':\n            depth -= 1\n            current += char\n            if depth == 0:\n                result.append(current)\n                current = ''\n    return result",
    "task_id": "HumanEval/1",
    "expected_behavior": "Split string of nested parentheses into separate balanced groups, ignoring spaces"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/text/code-annotation/humaneval-code-generation
potato start config.yaml

Dataset & paper

Chen et al., arXiv 2021

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{chen2021evaluating,
    title = "Evaluating Large Language Models Trained on Code",
    author = "Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and Ryder, Nick and Pavlov, Mikhail and Power, Alethea and Kaiser, Lukasz and Bavarian, Mohammad and Winter, Clemens and Tillet, Philippe and Such, Felipe Petroski and Cummings, Dave and Plappert, Matthias and Chanber, Fotios and Amodei, Dario and Zaremba, Wojciech",
    journal = "arXiv preprint arXiv:2107.03374",
    year = "2021",
    url = "https://arxiv.org/abs/2107.03374"
}

Details

Annotation Types

radiotext

Domain

Code GenerationProgram SynthesisAI Evaluation

Use Cases

LLM EvaluationCode Correctness AssessmentCode Quality Rating

Related Designs

BIG-Bench Task Evaluation

Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.

radiotext

DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.

multirateradio

IFEval: Instruction-Following Evaluation Benchmark for LLMs

IFEval is the instruction-following benchmark from Google Research with 541 prompts built around 25 verifiable instruction types (word counts, formatting, keywords, structure). Scored by prompt-level and instruction-level accuracy, strict and loose. Reproduce the labeling in Potato.

radiomultiselect