SWE-bench: Code Agent Solution Evaluation

Evaluation of code agent solutions to real GitHub issues. Annotators review code patches generated by AI agents, assess correctness, check test compatibility, and evaluate code quality.

設定ファイルconfig.yaml

# SWE-bench: Code Agent Solution Evaluation
# Based on "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (Jimenez et al., ICLR 2024)
# Task: Review code patches generated by AI agents for real GitHub issues

annotation_task_name: "SWE-bench Code Agent Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing issue, patch, and test results
html_layout: |
  <div class="swebench-container">
    <div class="repo-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
      <strong>Repository:</strong> {{repo_name}}
    </div>
    <div class="issue-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
      <h3 style="margin-top: 0;">GitHub Issue:</h3>
      <div class="issue-text" style="font-size: 15px; white-space: pre-wrap;">{{text}}</div>
    </div>
    <div class="patch-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #424242;">
      <h3 style="margin-top: 0;">Agent-Generated Patch:</h3>
      <pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.5; overflow-x: auto;">{{patch}}</pre>
    </div>
    <div class="test-section" style="background: #fff3e0; padding: 15px; border-radius: 8px; border: 2px solid #f9a825;">
      <h3 style="margin-top: 0; color: #f9a825;">Test Results:</h3>
      <pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px;">{{test_results}}</pre>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Correctness assessment
  - name: "correctness"
    description: "Does the patch correctly fix the reported issue?"
    annotation_type: radio
    labels:
      - "Correct - fully fixes the issue"
      - "Partially Correct - fixes some aspects but not all"
      - "Incorrect - does not fix the issue"
      - "Introduces new bugs - fix creates other problems"
    keyboard_shortcuts:
      "Correct - fully fixes the issue": "1"
      "Partially Correct - fixes some aspects but not all": "2"
      "Incorrect - does not fix the issue": "3"
      "Introduces new bugs - fix creates other problems": "4"

  # Code quality assessment
  - name: "code_quality"
    description: "Rate the overall quality of the generated code."
    annotation_type: radio
    labels:
      - "Clean - well-structured, follows conventions, maintainable"
      - "Acceptable - works but could be improved"
      - "Poor - messy, violates conventions, hard to maintain"
    keyboard_shortcuts:
      "Clean - well-structured, follows conventions, maintainable": "a"
      "Acceptable - works but could be improved": "s"
      "Poor - messy, violates conventions, hard to maintain": "d"

  # Review comments
  - name: "review_comments"
    description: "Provide code review comments explaining your assessment. Note specific issues, suggestions for improvement, or edge cases."
    annotation_type: text
    min_length: 20
    max_length: 600
    placeholder: "Provide specific code review feedback: correctness issues, style concerns, edge cases missed, test coverage gaps..."

  # Test compatibility
  - name: "test_compatibility"
    description: "Are the existing tests passing with this patch?"
    annotation_type: radio
    labels:
      - "All tests pass"
      - "Most tests pass (minor failures)"
      - "Some tests fail"
      - "Many tests fail"
      - "Cannot determine from provided info"
    keyboard_shortcuts:
      "All tests pass": "z"
      "Most tests pass (minor failures)": "x"
      "Some tests fail": "c"
      "Many tests fail": "v"
      "Cannot determine from provided info": "b"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2

サンプルデータsample-data.json

[
  {
    "id": "swe_001",
    "text": "Issue #4521: `DataFrame.groupby().apply()` raises TypeError when the applied function returns a scalar value for groups with a single row.\n\nSteps to reproduce:\n```python\nimport pandas as pd\ndf = pd.DataFrame({'key': ['a', 'b', 'b'], 'value': [1, 2, 3]})\nresult = df.groupby('key').apply(lambda x: x['value'].sum())\n```\nExpected: Returns a Series with group sums.\nActual: TypeError: 'int' object is not iterable",
    "patch": "--- a/pandas/core/groupby/groupby.py\n+++ b/pandas/core/groupby/groupby.py\n@@ -1245,7 +1245,10 @@ class GroupBy:\n     def _apply_func(self, func, data):\n         result = func(data)\n-        if isinstance(result, DataFrame):\n+        if is_scalar(result):\n+            return Series([result], index=data.index[:1])\n+        elif isinstance(result, DataFrame):\n             return result\n         elif isinstance(result, Series):\n             return result\n         else:\n             return Series(result)",
    "repo_name": "pandas-dev/pandas",
    "test_results": "PASSED: test_groupby_apply_scalar_return (3 subtests)\nPASSED: test_groupby_apply_dataframe_return\nPASSED: test_groupby_apply_series_return\nFAILED: test_groupby_apply_empty_groups - AssertionError: Expected empty DataFrame, got Series\n\nTotal: 15 passed, 1 failed, 0 errors"
  },
  {
    "id": "swe_002",
    "text": "Issue #8734: `json.dumps()` does not properly handle `datetime.date` objects, raising `TypeError: Object of type date is not JSON serializable`.\n\nThis should be handled gracefully with a default serializer or a clear error message suggesting a solution.",
    "patch": "--- a/Lib/json/encoder.py\n+++ b/Lib/json/encoder.py\n@@ -178,6 +178,8 @@ class JSONEncoder:\n     def default(self, o):\n         if isinstance(o, datetime.datetime):\n             return o.isoformat()\n+        elif isinstance(o, datetime.date):\n+            return o.isoformat()\n         raise TypeError(f\"Object of type {type(o).__name__} is not JSON serializable. \"\n                        f\"Consider using a custom encoder or converting to a \"\n                        f\"serializable type first.\")",
    "repo_name": "python/cpython",
    "test_results": "PASSED: test_date_serialization\nPASSED: test_datetime_serialization\nPASSED: test_custom_encoder_still_works\nPASSED: test_error_message_improved\nPASSED: test_nested_date_objects\n\nTotal: 12 passed, 0 failed, 0 errors"
  }
]

// ... and 8 more items

このデザインを取得

View on GitHub

Clone or download from the repository

クイックスタート：

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/swebench-code-agent-eval
potato start config.yaml

詳細

アノテーションタイプ

radiotext

ドメイン

Software EngineeringCode AgentsEvaluation

ユースケース

Code ReviewAgent EvaluationPatch Assessment

SWE-bench: Code Agent Solution Evaluation

設定ファイルconfig.yaml

サンプルデータsample-data.json

このデザインを取得

詳細

アノテーションタイプ

ドメイン

ユースケース

タグ

関連デザイン

Code Review Annotation (CodeReviewer)

FActScore: Fine-grained Atomic Evaluation of Factual Precision

GPQA - Graduate-Level Expert QA Evaluation