BigCodeBench Human Baseline Evaluation
Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.
Configuration Fileconfig.yaml
# BigCodeBench Human Baseline Evaluation
# Based on "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions" (Zhuo et al., ICLR 2025)
# Task: Evaluate agent-generated code solutions for correctness, complexity, and quality
annotation_task_name: "BigCodeBench Human Baseline Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1400px; margin: 0 auto;">
<div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin-bottom: 12px; background: #f6f8fa;">
<h3 style="margin-top: 0; color: #24292f; border-bottom: 1px solid #d0d7de; padding-bottom: 8px;">Task Description</h3>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
</div>
<div style="border: 1px solid #30363d; border-radius: 6px; overflow: hidden; margin-bottom: 12px;">
<div style="background: #2d333b; color: #adbac7; padding: 8px 16px; font-weight: 600; font-size: 13px;">Function Signature</div>
<pre style="margin: 0; padding: 12px 16px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 13px; line-height: 1.5;">{{function_signature}}</pre>
</div>
<div style="border: 1px solid #30363d; border-radius: 6px; overflow: hidden; margin-bottom: 12px;">
<div style="background: #1a2233; color: #58a6ff; padding: 8px 16px; font-weight: 600; font-size: 13px;">Agent's Generated Code</div>
<pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{agent_code}}</pre>
</div>
<div style="border: 1px solid #1a3a1a; border-radius: 6px; overflow: hidden;">
<div style="background: #0d1117; color: #3fb950; padding: 8px 16px; font-weight: 600; font-size: 13px;">Execution Results</div>
<pre style="margin: 0; padding: 12px 16px; background: #0a0e14; color: #3fb950; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{execution_result}}</pre>
</div>
</div>
annotation_schemes:
- name: "solution_correctness"
description: "Does the generated solution pass the test suite?"
annotation_type: radio
labels:
- "Correct — passes all tests"
- "Partially Correct — passes some tests"
- "Incorrect — fails tests"
keyboard_shortcuts:
"Correct — passes all tests": "1"
"Partially Correct — passes some tests": "2"
"Incorrect — fails tests": "3"
- name: "complexity_assessment"
description: "How complex is this coding task?"
annotation_type: radio
labels:
- "Easy"
- "Medium"
- "Hard"
- "Expert"
keyboard_shortcuts:
"Easy": "4"
"Medium": "5"
"Hard": "6"
"Expert": "7"
- name: "code_quality"
description: "Rate the overall quality of the generated code"
annotation_type: likert
min_label: "Very Poor"
max_label: "Excellent"
size: 5
- name: "human_notes"
description: "Notes on the solution approach, edge cases, or issues"
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "bcb-001",
"text": "Write a function that reads a CSV file, groups rows by a specified column, computes the mean of a numeric column for each group, and returns a sorted bar chart as a matplotlib Figure. Handle missing values by dropping them before aggregation. The function should validate that both columns exist in the CSV.",
"function_signature": "def task_func(csv_path: str, group_col: str, value_col: str) -> matplotlib.figure.Figure:",
"agent_code": "import pandas as pd\nimport matplotlib.pyplot as plt\nimport matplotlib.figure\n\ndef task_func(csv_path: str, group_col: str, value_col: str) -> matplotlib.figure.Figure:\n df = pd.read_csv(csv_path)\n \n if group_col not in df.columns:\n raise ValueError(f\"Column '{group_col}' not found in CSV\")\n if value_col not in df.columns:\n raise ValueError(f\"Column '{value_col}' not found in CSV\")\n \n df = df.dropna(subset=[group_col, value_col])\n grouped = df.groupby(group_col)[value_col].mean().sort_values(ascending=False)\n \n fig, ax = plt.subplots(figsize=(10, 6))\n grouped.plot(kind='bar', ax=ax)\n ax.set_xlabel(group_col)\n ax.set_ylabel(f'Mean {value_col}')\n ax.set_title(f'Mean {value_col} by {group_col}')\n plt.tight_layout()\n \n return fig",
"execution_result": "$ python -m pytest test_task_001.py -v\ntest_task_001.py::test_basic_grouping PASSED\ntest_task_001.py::test_missing_values_dropped PASSED\ntest_task_001.py::test_invalid_group_column PASSED\ntest_task_001.py::test_invalid_value_column PASSED\ntest_task_001.py::test_sorted_descending PASSED\ntest_task_001.py::test_returns_figure PASSED\n\n6 passed in 1.23s"
},
{
"id": "bcb-002",
"text": "Write a function that takes a directory path, recursively finds all Python files, parses each file's AST to extract function definitions, and returns a dictionary mapping file paths to lists of function names. Skip files that have syntax errors. Include nested functions.",
"function_signature": "def task_func(directory: str) -> dict[str, list[str]]:",
"agent_code": "import os\nimport ast\nfrom typing import Dict, List\n\ndef task_func(directory: str) -> Dict[str, List[str]]:\n result = {}\n \n for root, dirs, files in os.walk(directory):\n for fname in files:\n if not fname.endswith('.py'):\n continue\n fpath = os.path.join(root, fname)\n try:\n with open(fpath, 'r') as f:\n source = f.read()\n tree = ast.parse(source)\n except SyntaxError:\n continue\n \n functions = []\n for node in ast.walk(tree):\n if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):\n functions.append(node.name)\n \n if functions:\n result[fpath] = functions\n \n return result",
"execution_result": "$ python -m pytest test_task_002.py -v\ntest_task_002.py::test_basic_extraction PASSED\ntest_task_002.py::test_nested_functions PASSED\ntest_task_002.py::test_async_functions PASSED\ntest_task_002.py::test_syntax_error_skip PASSED\ntest_task_002.py::test_empty_directory PASSED\ntest_task_002.py::test_no_python_files PASSED\ntest_task_002.py::test_subdirectories PASSED\n\n7 passed in 0.45s"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/bigcodebench-human-baseline potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.
AgentBoard Progress Scoring
Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.
Coreference Resolution (OntoNotes)
Link pronouns and noun phrases to the entities they refer to in text. Based on the OntoNotes coreference annotation guidelines and CoNLL shared tasks. Identify mention spans and cluster coreferent mentions together.