SWE-bench: Code Agent Solution Evaluation

Evaluation of code agent solutions to real GitHub issues. Annotators review code patches generated by AI agents, assess correctness, check test compatibility, and evaluate code quality.

Configuration Fileconfig.yaml

yaml

# SWE-bench: Code Agent Solution Evaluation
# Based on "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (Jimenez et al., ICLR 2024)
# Task: Review code patches generated by AI agents for real GitHub issues

annotation_task_name: "SWE-bench Code Agent Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing issue, patch, and test results
html_layout: |
  <div class="swebench-container">
    <div class="repo-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
      <strong>Repository:</strong> {{repo_name}}
    </div>
    <div class="issue-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">GitHub Issue:</h3>
      <div class="issue-text" style="font-size: 15px; white-space: pre-wrap;">{{text}}</div>
    </div>
    <div class="patch-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #424242;">
      <h3 style="margin-top: 0;">Agent-Generated Patch:</h3>
      <pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.5; overflow-x: auto;">{{patch}}</pre>
    </div>
    <div class="test-section" style="background: #fff3e0; padding: 15px; border-radius: 8px; border: 2px solid #f9a825;">
      <h3 style="margin-top: 0; color: #f9a825;">Test Results:</h3>
      <pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px;">{{test_results}}</pre>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Correctness assessment
  - name: "correctness"
    description: "Does the patch correctly fix the reported issue?"
    annotation_type: radio
    labels:
      - "Correct - fully fixes the issue"
      - "Partially Correct - fixes some aspects but not all"
      - "Incorrect - does not fix the issue"
      - "Introduces new bugs - fix creates other problems"
    keyboard_shortcuts:
      "Correct - fully fixes the issue": "1"
      "Partially Correct - fixes some aspects but not all": "2"
      "Incorrect - does not fix the issue": "3"
      "Introduces new bugs - fix creates other problems": "4"

  # Code quality assessment
  - name: "code_quality"
    description: "Rate the overall quality of the generated code."
    annotation_type: radio
    labels:
      - "Clean - well-structured, follows conventions, maintainable"
      - "Acceptable - works but could be improved"
      - "Poor - messy, violates conventions, hard to maintain"
    keyboard_shortcuts:
      "Clean - well-structured, follows conventions, maintainable": "a"
      "Acceptable - works but could be improved": "s"
      "Poor - messy, violates conventions, hard to maintain": "d"

  # Review comments
  - name: "review_comments"
    description: "Provide code review comments explaining your assessment. Note specific issues, suggestions for improvement, or edge cases."
    annotation_type: text
    min_length: 20
    max_length: 600
    placeholder: "Provide specific code review feedback: correctness issues, style concerns, edge cases missed, test coverage gaps..."

  # Test compatibility
  - name: "test_compatibility"
    description: "Are the existing tests passing with this patch?"
    annotation_type: radio
    labels:
      - "All tests pass"
      - "Most tests pass (minor failures)"
      - "Some tests fail"
      - "Many tests fail"
      - "Cannot determine from provided info"
    keyboard_shortcuts:
      "All tests pass": "z"
      "Most tests pass (minor failures)": "x"
      "Some tests fail": "c"
      "Many tests fail": "v"
      "Cannot determine from provided info": "b"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "swe_001",
    "text": "Issue #4521: `DataFrame.groupby().apply()` raises TypeError when the applied function returns a scalar value for groups with a single row.\n\nSteps to reproduce:\n```python\nimport pandas as pd\ndf = pd.DataFrame({'key': ['a', 'b', 'b'], 'value': [1, 2, 3]})\nresult = df.groupby('key').apply(lambda x: x['value'].sum())\n```\nExpected: Returns a Series with group sums.\nActual: TypeError: 'int' object is not iterable",
    "patch": "--- a/pandas/core/groupby/groupby.py\n+++ b/pandas/core/groupby/groupby.py\n@@ -1245,7 +1245,10 @@ class GroupBy:\n     def _apply_func(self, func, data):\n         result = func(data)\n-        if isinstance(result, DataFrame):\n+        if is_scalar(result):\n+            return Series([result], index=data.index[:1])\n+        elif isinstance(result, DataFrame):\n             return result\n         elif isinstance(result, Series):\n             return result\n         else:\n             return Series(result)",
    "repo_name": "pandas-dev/pandas",
    "test_results": "PASSED: test_groupby_apply_scalar_return (3 subtests)\nPASSED: test_groupby_apply_dataframe_return\nPASSED: test_groupby_apply_series_return\nFAILED: test_groupby_apply_empty_groups - AssertionError: Expected empty DataFrame, got Series\n\nTotal: 15 passed, 1 failed, 0 errors"
  },
  {
    "id": "swe_002",
    "text": "Issue #8734: `json.dumps()` does not properly handle `datetime.date` objects, raising `TypeError: Object of type date is not JSON serializable`.\n\nThis should be handled gracefully with a default serializer or a clear error message suggesting a solution.",
    "patch": "--- a/Lib/json/encoder.py\n+++ b/Lib/json/encoder.py\n@@ -178,6 +178,8 @@ class JSONEncoder:\n     def default(self, o):\n         if isinstance(o, datetime.datetime):\n             return o.isoformat()\n+        elif isinstance(o, datetime.date):\n+            return o.isoformat()\n         raise TypeError(f\"Object of type {type(o).__name__} is not JSON serializable. \"\n                        f\"Consider using a custom encoder or converting to a \"\n                        f\"serializable type first.\")",
    "repo_name": "python/cpython",
    "test_results": "PASSED: test_date_serialization\nPASSED: test_datetime_serialization\nPASSED: test_custom_encoder_still_works\nPASSED: test_error_message_improved\nPASSED: test_nested_date_objects\n\nTotal: 12 passed, 0 failed, 0 errors"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/swebench-code-agent-eval
potato start config.yaml

Details

Annotation Types

radiotext

Domain

Software EngineeringCode AgentsEvaluation

Use Cases

Code ReviewAgent EvaluationPatch Assessment

Related Designs

AgentRewardBench Trajectory Scoring

Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.

multirateradio

Code Review Annotation (CodeReviewer)

Annotation of code review activities based on the CodeReviewer benchmark. Annotators identify issues in code diffs, classify defect types, assign severity levels, make review decisions, and provide natural language review comments, supporting research in automated code review and software engineering.

spanradio

DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.