BigCodeBench Human Baseline Evaluation

Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# BigCodeBench Human Baseline Evaluation
# Based on "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions" (Zhuo et al., ICLR 2025)
# Task: Evaluate agent-generated code solutions for correctness, complexity, and quality

annotation_task_name: "BigCodeBench Human Baseline Evaluation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1400px; margin: 0 auto;">
    <div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin-bottom: 12px; background: #f6f8fa;">
      <h3 style="margin-top: 0; color: #24292f; border-bottom: 1px solid #d0d7de; padding-bottom: 8px;">Task Description</h3>
      <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
    </div>
    <div style="border: 1px solid #30363d; border-radius: 6px; overflow: hidden; margin-bottom: 12px;">
      <div style="background: #2d333b; color: #adbac7; padding: 8px 16px; font-weight: 600; font-size: 13px;">Function Signature</div>
      <pre style="margin: 0; padding: 12px 16px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 13px; line-height: 1.5;">{{function_signature}}</pre>
    </div>
    <div style="border: 1px solid #30363d; border-radius: 6px; overflow: hidden; margin-bottom: 12px;">
      <div style="background: #1a2233; color: #58a6ff; padding: 8px 16px; font-weight: 600; font-size: 13px;">Agent's Generated Code</div>
      <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{agent_code}}</pre>
    </div>
    <div style="border: 1px solid #1a3a1a; border-radius: 6px; overflow: hidden;">
      <div style="background: #0d1117; color: #3fb950; padding: 8px 16px; font-weight: 600; font-size: 13px;">Execution Results</div>
      <pre style="margin: 0; padding: 12px 16px; background: #0a0e14; color: #3fb950; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{execution_result}}</pre>
    </div>
  </div>

annotation_schemes:
  - name: "solution_correctness"
    description: "Does the generated solution pass the test suite?"
    annotation_type: radio
    labels:
      - "Correct — passes all tests"
      - "Partially Correct — passes some tests"
      - "Incorrect — fails tests"
    keyboard_shortcuts:
      "Correct — passes all tests": "1"
      "Partially Correct — passes some tests": "2"
      "Incorrect — fails tests": "3"

  - name: "complexity_assessment"
    description: "How complex is this coding task?"
    annotation_type: radio
    labels:
      - "Easy"
      - "Medium"
      - "Hard"
      - "Expert"
    keyboard_shortcuts:
      "Easy": "4"
      "Medium": "5"
      "Hard": "6"
      "Expert": "7"

  - name: "code_quality"
    description: "Rate the overall quality of the generated code"
    annotation_type: likert
    min_label: "Very Poor"
    max_label: "Excellent"
    size: 5

  - name: "human_notes"
    description: "Notes on the solution approach, edge cases, or issues"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "bcb-001",
    "text": "Write a function that reads a CSV file, groups rows by a specified column, computes the mean of a numeric column for each group, and returns a sorted bar chart as a matplotlib Figure. Handle missing values by dropping them before aggregation. The function should validate that both columns exist in the CSV.",
    "function_signature": "def task_func(csv_path: str, group_col: str, value_col: str) -> matplotlib.figure.Figure:",
    "agent_code": "import pandas as pd\nimport matplotlib.pyplot as plt\nimport matplotlib.figure\n\ndef task_func(csv_path: str, group_col: str, value_col: str) -> matplotlib.figure.Figure:\n    df = pd.read_csv(csv_path)\n    \n    if group_col not in df.columns:\n        raise ValueError(f\"Column '{group_col}' not found in CSV\")\n    if value_col not in df.columns:\n        raise ValueError(f\"Column '{value_col}' not found in CSV\")\n    \n    df = df.dropna(subset=[group_col, value_col])\n    grouped = df.groupby(group_col)[value_col].mean().sort_values(ascending=False)\n    \n    fig, ax = plt.subplots(figsize=(10, 6))\n    grouped.plot(kind='bar', ax=ax)\n    ax.set_xlabel(group_col)\n    ax.set_ylabel(f'Mean {value_col}')\n    ax.set_title(f'Mean {value_col} by {group_col}')\n    plt.tight_layout()\n    \n    return fig",
    "execution_result": "$ python -m pytest test_task_001.py -v\ntest_task_001.py::test_basic_grouping PASSED\ntest_task_001.py::test_missing_values_dropped PASSED\ntest_task_001.py::test_invalid_group_column PASSED\ntest_task_001.py::test_invalid_value_column PASSED\ntest_task_001.py::test_sorted_descending PASSED\ntest_task_001.py::test_returns_figure PASSED\n\n6 passed in 1.23s"
  },
  {
    "id": "bcb-002",
    "text": "Write a function that takes a directory path, recursively finds all Python files, parses each file's AST to extract function definitions, and returns a dictionary mapping file paths to lists of function names. Skip files that have syntax errors. Include nested functions.",
    "function_signature": "def task_func(directory: str) -> dict[str, list[str]]:",
    "agent_code": "import os\nimport ast\nfrom typing import Dict, List\n\ndef task_func(directory: str) -> Dict[str, List[str]]:\n    result = {}\n    \n    for root, dirs, files in os.walk(directory):\n        for fname in files:\n            if not fname.endswith('.py'):\n                continue\n            fpath = os.path.join(root, fname)\n            try:\n                with open(fpath, 'r') as f:\n                    source = f.read()\n                tree = ast.parse(source)\n            except SyntaxError:\n                continue\n            \n            functions = []\n            for node in ast.walk(tree):\n                if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):\n                    functions.append(node.name)\n            \n            if functions:\n                result[fpath] = functions\n    \n    return result",
    "execution_result": "$ python -m pytest test_task_002.py -v\ntest_task_002.py::test_basic_extraction PASSED\ntest_task_002.py::test_nested_functions PASSED\ntest_task_002.py::test_async_functions PASSED\ntest_task_002.py::test_syntax_error_skip PASSED\ntest_task_002.py::test_empty_directory PASSED\ntest_task_002.py::test_no_python_files PASSED\ntest_task_002.py::test_subdirectories PASSED\n\n7 passed in 0.45s"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/bigcodebench-human-baseline
potato start config.yaml

Dataset & paper

Zhuo et al., ICLR 2025

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{zhuo2025bigcodebench, title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions}, author={Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and others}, booktitle={International Conference on Learning Representations (ICLR)}, year={2025}}

Details

Annotation Types

radiolikerttext

Domain

Software EngineeringCode Generation

Use Cases

Code EvaluationBenchmark Baseline

Related Designs

RefactorBench Multi-File Evaluation

Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.

radiomultiselect

AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

multiselectlikert

Coreference Resolution (OntoNotes)

Link pronouns and noun phrases to the entities they refer to in text. Based on the OntoNotes coreference annotation guidelines and CoNLL shared tasks. Identify mention spans and cluster coreferent mentions together.

likertradio