Skip to content
Showcase/BigCodeBench Human Baseline Evaluation
advancedsurvey

BigCodeBench Human Baseline Evaluation

Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# BigCodeBench Human Baseline Evaluation
# Based on "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions" (Zhuo et al., ICLR 2025)
# Task: Evaluate agent-generated code solutions for correctness, complexity, and quality

annotation_task_name: "BigCodeBench Human Baseline Evaluation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1400px; margin: 0 auto;">
    <div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin-bottom: 12px; background: #f6f8fa;">
      <h3 style="margin-top: 0; color: #24292f; border-bottom: 1px solid #d0d7de; padding-bottom: 8px;">Task Description</h3>
      <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
    </div>
    <div style="border: 1px solid #30363d; border-radius: 6px; overflow: hidden; margin-bottom: 12px;">
      <div style="background: #2d333b; color: #adbac7; padding: 8px 16px; font-weight: 600; font-size: 13px;">Function Signature</div>
      <pre style="margin: 0; padding: 12px 16px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 13px; line-height: 1.5;">{{function_signature}}</pre>
    </div>
    <div style="border: 1px solid #30363d; border-radius: 6px; overflow: hidden; margin-bottom: 12px;">
      <div style="background: #1a2233; color: #58a6ff; padding: 8px 16px; font-weight: 600; font-size: 13px;">Agent's Generated Code</div>
      <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{agent_code}}</pre>
    </div>
    <div style="border: 1px solid #1a3a1a; border-radius: 6px; overflow: hidden;">
      <div style="background: #0d1117; color: #3fb950; padding: 8px 16px; font-weight: 600; font-size: 13px;">Execution Results</div>
      <pre style="margin: 0; padding: 12px 16px; background: #0a0e14; color: #3fb950; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{execution_result}}</pre>
    </div>
  </div>

annotation_schemes:
  - name: "solution_correctness"
    description: "Does the generated solution pass the test suite?"
    annotation_type: radio
    labels:
      - "Correct — passes all tests"
      - "Partially Correct — passes some tests"
      - "Incorrect — fails tests"
    keyboard_shortcuts:
      "Correct — passes all tests": "1"
      "Partially Correct — passes some tests": "2"
      "Incorrect — fails tests": "3"

  - name: "complexity_assessment"
    description: "How complex is this coding task?"
    annotation_type: radio
    labels:
      - "Easy"
      - "Medium"
      - "Hard"
      - "Expert"
    keyboard_shortcuts:
      "Easy": "4"
      "Medium": "5"
      "Hard": "6"
      "Expert": "7"

  - name: "code_quality"
    description: "Rate the overall quality of the generated code"
    annotation_type: likert
    min_label: "Very Poor"
    max_label: "Excellent"
    size: 5

  - name: "human_notes"
    description: "Notes on the solution approach, edge cases, or issues"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "bcb-001",
    "text": "Write a function that reads a CSV file, groups rows by a specified column, computes the mean of a numeric column for each group, and returns a sorted bar chart as a matplotlib Figure. Handle missing values by dropping them before aggregation. The function should validate that both columns exist in the CSV.",
    "function_signature": "def task_func(csv_path: str, group_col: str, value_col: str) -> matplotlib.figure.Figure:",
    "agent_code": "import pandas as pd\nimport matplotlib.pyplot as plt\nimport matplotlib.figure\n\ndef task_func(csv_path: str, group_col: str, value_col: str) -> matplotlib.figure.Figure:\n    df = pd.read_csv(csv_path)\n    \n    if group_col not in df.columns:\n        raise ValueError(f\"Column '{group_col}' not found in CSV\")\n    if value_col not in df.columns:\n        raise ValueError(f\"Column '{value_col}' not found in CSV\")\n    \n    df = df.dropna(subset=[group_col, value_col])\n    grouped = df.groupby(group_col)[value_col].mean().sort_values(ascending=False)\n    \n    fig, ax = plt.subplots(figsize=(10, 6))\n    grouped.plot(kind='bar', ax=ax)\n    ax.set_xlabel(group_col)\n    ax.set_ylabel(f'Mean {value_col}')\n    ax.set_title(f'Mean {value_col} by {group_col}')\n    plt.tight_layout()\n    \n    return fig",
    "execution_result": "$ python -m pytest test_task_001.py -v\ntest_task_001.py::test_basic_grouping PASSED\ntest_task_001.py::test_missing_values_dropped PASSED\ntest_task_001.py::test_invalid_group_column PASSED\ntest_task_001.py::test_invalid_value_column PASSED\ntest_task_001.py::test_sorted_descending PASSED\ntest_task_001.py::test_returns_figure PASSED\n\n6 passed in 1.23s"
  },
  {
    "id": "bcb-002",
    "text": "Write a function that takes a directory path, recursively finds all Python files, parses each file's AST to extract function definitions, and returns a dictionary mapping file paths to lists of function names. Skip files that have syntax errors. Include nested functions.",
    "function_signature": "def task_func(directory: str) -> dict[str, list[str]]:",
    "agent_code": "import os\nimport ast\nfrom typing import Dict, List\n\ndef task_func(directory: str) -> Dict[str, List[str]]:\n    result = {}\n    \n    for root, dirs, files in os.walk(directory):\n        for fname in files:\n            if not fname.endswith('.py'):\n                continue\n            fpath = os.path.join(root, fname)\n            try:\n                with open(fpath, 'r') as f:\n                    source = f.read()\n                tree = ast.parse(source)\n            except SyntaxError:\n                continue\n            \n            functions = []\n            for node in ast.walk(tree):\n                if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):\n                    functions.append(node.name)\n            \n            if functions:\n                result[fpath] = functions\n    \n    return result",
    "execution_result": "$ python -m pytest test_task_002.py -v\ntest_task_002.py::test_basic_extraction PASSED\ntest_task_002.py::test_nested_functions PASSED\ntest_task_002.py::test_async_functions PASSED\ntest_task_002.py::test_syntax_error_skip PASSED\ntest_task_002.py::test_empty_directory PASSED\ntest_task_002.py::test_no_python_files PASSED\ntest_task_002.py::test_subdirectories PASSED\n\n7 passed in 0.45s"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/bigcodebench-human-baseline
potato start config.yaml

Details

Annotation Types

radiolikerttext

Domain

Software EngineeringCode Generation

Use Cases

Code EvaluationBenchmark Baseline

Tags

bigcodebenchcode-generationfunction-callsbenchmarkagentic-coding

Found an issue or want to improve this design?

Open an Issue