SWE-bench: Code Agent Solution Evaluation
Evaluation of code agent solutions to real GitHub issues. Annotators review code patches generated by AI agents, assess correctness, check test compatibility, and evaluate code quality.
Configuration Fileconfig.yaml
# SWE-bench: Code Agent Solution Evaluation
# Based on "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (Jimenez et al., ICLR 2024)
# Task: Review code patches generated by AI agents for real GitHub issues
annotation_task_name: "SWE-bench Code Agent Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing issue, patch, and test results
html_layout: |
<div class="swebench-container">
<div class="repo-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
<strong>Repository:</strong> {{repo_name}}
</div>
<div class="issue-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">GitHub Issue:</h3>
<div class="issue-text" style="font-size: 15px; white-space: pre-wrap;">{{text}}</div>
</div>
<div class="patch-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #424242;">
<h3 style="margin-top: 0;">Agent-Generated Patch:</h3>
<pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.5; overflow-x: auto;">{{patch}}</pre>
</div>
<div class="test-section" style="background: #fff3e0; padding: 15px; border-radius: 8px; border: 2px solid #f9a825;">
<h3 style="margin-top: 0; color: #f9a825;">Test Results:</h3>
<pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px;">{{test_results}}</pre>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Correctness assessment
- name: "correctness"
description: "Does the patch correctly fix the reported issue?"
annotation_type: radio
labels:
- "Correct - fully fixes the issue"
- "Partially Correct - fixes some aspects but not all"
- "Incorrect - does not fix the issue"
- "Introduces new bugs - fix creates other problems"
keyboard_shortcuts:
"Correct - fully fixes the issue": "1"
"Partially Correct - fixes some aspects but not all": "2"
"Incorrect - does not fix the issue": "3"
"Introduces new bugs - fix creates other problems": "4"
# Code quality assessment
- name: "code_quality"
description: "Rate the overall quality of the generated code."
annotation_type: radio
labels:
- "Clean - well-structured, follows conventions, maintainable"
- "Acceptable - works but could be improved"
- "Poor - messy, violates conventions, hard to maintain"
keyboard_shortcuts:
"Clean - well-structured, follows conventions, maintainable": "a"
"Acceptable - works but could be improved": "s"
"Poor - messy, violates conventions, hard to maintain": "d"
# Review comments
- name: "review_comments"
description: "Provide code review comments explaining your assessment. Note specific issues, suggestions for improvement, or edge cases."
annotation_type: text
min_length: 20
max_length: 600
placeholder: "Provide specific code review feedback: correctness issues, style concerns, edge cases missed, test coverage gaps..."
# Test compatibility
- name: "test_compatibility"
description: "Are the existing tests passing with this patch?"
annotation_type: radio
labels:
- "All tests pass"
- "Most tests pass (minor failures)"
- "Some tests fail"
- "Many tests fail"
- "Cannot determine from provided info"
keyboard_shortcuts:
"All tests pass": "z"
"Most tests pass (minor failures)": "x"
"Some tests fail": "c"
"Many tests fail": "v"
"Cannot determine from provided info": "b"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "swe_001",
"text": "Issue #4521: `DataFrame.groupby().apply()` raises TypeError when the applied function returns a scalar value for groups with a single row.\n\nSteps to reproduce:\n```python\nimport pandas as pd\ndf = pd.DataFrame({'key': ['a', 'b', 'b'], 'value': [1, 2, 3]})\nresult = df.groupby('key').apply(lambda x: x['value'].sum())\n```\nExpected: Returns a Series with group sums.\nActual: TypeError: 'int' object is not iterable",
"patch": "--- a/pandas/core/groupby/groupby.py\n+++ b/pandas/core/groupby/groupby.py\n@@ -1245,7 +1245,10 @@ class GroupBy:\n def _apply_func(self, func, data):\n result = func(data)\n- if isinstance(result, DataFrame):\n+ if is_scalar(result):\n+ return Series([result], index=data.index[:1])\n+ elif isinstance(result, DataFrame):\n return result\n elif isinstance(result, Series):\n return result\n else:\n return Series(result)",
"repo_name": "pandas-dev/pandas",
"test_results": "PASSED: test_groupby_apply_scalar_return (3 subtests)\nPASSED: test_groupby_apply_dataframe_return\nPASSED: test_groupby_apply_series_return\nFAILED: test_groupby_apply_empty_groups - AssertionError: Expected empty DataFrame, got Series\n\nTotal: 15 passed, 1 failed, 0 errors"
},
{
"id": "swe_002",
"text": "Issue #8734: `json.dumps()` does not properly handle `datetime.date` objects, raising `TypeError: Object of type date is not JSON serializable`.\n\nThis should be handled gracefully with a default serializer or a clear error message suggesting a solution.",
"patch": "--- a/Lib/json/encoder.py\n+++ b/Lib/json/encoder.py\n@@ -178,6 +178,8 @@ class JSONEncoder:\n def default(self, o):\n if isinstance(o, datetime.datetime):\n return o.isoformat()\n+ elif isinstance(o, datetime.date):\n+ return o.isoformat()\n raise TypeError(f\"Object of type {type(o).__name__} is not JSON serializable. \"\n f\"Consider using a custom encoder or converting to a \"\n f\"serializable type first.\")",
"repo_name": "python/cpython",
"test_results": "PASSED: test_date_serialization\nPASSED: test_datetime_serialization\nPASSED: test_custom_encoder_still_works\nPASSED: test_error_message_improved\nPASSED: test_nested_date_objects\n\nTotal: 12 passed, 0 failed, 0 errors"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/swebench-code-agent-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
AgentRewardBench Trajectory Scoring
Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.
Code Review Annotation (CodeReviewer)
Annotation of code review activities based on the CodeReviewer benchmark. Annotators identify issues in code diffs, classify defect types, assign severity levels, make review decisions, and provide natural language review comments, supporting research in automated code review and software engineering.
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.