SWE-bench Verified Issue Validation

Manually validate GitHub issues from SWE-bench to ensure they are well-specified, have adequate test patches, and are solvable. Annotators review the issue description, test patch, and gold patch to determine quality of each benchmark instance.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# SWE-bench Verified Issue Validation
# Based on "SWE-bench Verified" (Neil Chowdhury, James Aung, Chan Jun Shern et al., OpenAI Technical Report 2024)
# Task: Validate GitHub issues from SWE-bench for specification quality, test adequacy, and solvability

annotation_task_name: "SWE-bench Verified Issue Validation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1400px; margin: 0 auto;">
    <div style="background: #0d1117; color: #c9d1d9; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px; font-weight: 600;">
      <span style="color: #58a6ff;">{{repo_name}}</span>
    </div>
    <div style="display: flex; gap: 16px; margin-top: 4px;">
      <div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; padding: 16px; background: #f6f8fa;">
        <h3 style="margin-top: 0; color: #24292f; border-bottom: 1px solid #d0d7de; padding-bottom: 8px;">Issue Description</h3>
        <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
      </div>
      <div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
        <div style="background: #161b22; color: #c9d1d9; padding: 8px 16px; font-weight: 600; font-size: 13px;">Test Patch</div>
        <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{test_patch}}</pre>
      </div>
    </div>
    <div style="margin-top: 12px; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
      <div style="background: #161b22; color: #c9d1d9; padding: 8px 16px; font-weight: 600; font-size: 13px;">Gold Patch (Reference Fix)</div>
      <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{gold_patch}}</pre>
    </div>
  </div>

annotation_schemes:
  - name: "issue_valid"
    description: "Is the issue well-specified with a clear, reproducible problem?"
    annotation_type: radio
    labels:
      - "Well-specified — clear problem with reproducible steps"
      - "Ambiguous — issue unclear or underspecified"
      - "Invalid — not a real bug or feature request"
    keyboard_shortcuts:
      "Well-specified — clear problem with reproducible steps": "1"
      "Ambiguous — issue unclear or underspecified": "2"
      "Invalid — not a real bug or feature request": "3"

  - name: "test_adequate"
    description: "Do the tests adequately verify the fix?"
    annotation_type: radio
    labels:
      - "Sufficient — tests verify the fix completely"
      - "Partial — tests cover some aspects"
      - "Insufficient — tests don't adequately verify"
    keyboard_shortcuts:
      "Sufficient — tests verify the fix completely": "4"
      "Partial — tests cover some aspects": "5"
      "Insufficient — tests don't adequately verify": "6"

  - name: "solution_exists"
    description: "Is a solution feasible within the existing codebase?"
    annotation_type: radio
    labels:
      - "Solvable — clear fix exists in codebase"
      - "Likely solvable — fix probable but complex"
      - "Unlikely — may require architectural changes"
      - "Unsolvable — impossible given constraints"
    keyboard_shortcuts:
      "Solvable — clear fix exists in codebase": "7"
      "Likely solvable — fix probable but complex": "8"
      "Unlikely — may require architectural changes": "9"
      "Unsolvable — impossible given constraints": "0"

  - name: "validation_notes"
    description: "Explain your validation reasoning"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "swebench-val-001",
    "text": "django__django-16527: QuerySet.only() doesn't work with select_related() on reverse OneToOneField relation.\n\nWhen using .only() with .select_related() on a reverse OneToOneField, Django generates a query that includes all fields instead of only the specified ones.\n\nSteps to reproduce:\n1. Create models with OneToOneField relationship\n2. Use queryset.select_related('reverse_relation').only('id', 'reverse_relation__name')\n3. Inspect the generated SQL\n\nExpected: SELECT only specified columns\nActual: SELECT includes all columns from both tables",
    "repo_name": "django/django",
    "test_patch": "diff --git a/tests/select_related_onetoone/tests.py b/tests/select_related_onetoone/tests.py\nindex 3a4e512f8a..b7c2d91e03 100644\n--- a/tests/select_related_onetoone/tests.py\n+++ b/tests/select_related_onetoone/tests.py\n@@ -187,6 +187,18 @@ class ReverseSelectRelatedTestCase(TestCase):\n+    def test_only_with_select_related_reverse_onetoone(self):\n+        with self.assertNumQueries(1):\n+            qs = UserProfile.objects.select_related('user').only(\n+                'id', 'user__username'\n+            )\n+            result = list(qs)\n+            self.assertEqual(len(result), 1)\n+            query_str = str(qs.query)\n+            self.assertNotIn('email', query_str)\n+            self.assertNotIn('first_name', query_str)",
    "gold_patch": "diff --git a/django/db/models/sql/compiler.py b/django/db/models/sql/compiler.py\nindex 8e4a37b2ec..f3c5d12a91 100644\n--- a/django/db/models/sql/compiler.py\n+++ b/django/db/models/sql/compiler.py\n@@ -1042,7 +1042,10 @@ class SQLCompiler:\n         if opts.proxy:\n             return self.deferred_to_columns_cb(opts.proxy_for_model._meta, start_alias)\n-        if start_alias:\n+        if start_alias and self.query.deferred_loading[0]:\n+            only_load = self.query.deferred_loading[0]\n+            fields = [f for f in opts.concrete_fields if f.attname in only_load]\n+            return {start_alias: {f.column for f in fields}}\n         columns = {start_alias: set()}\n         for f in opts.concrete_fields:\n             if f.column in columns[start_alias]:"
  },
  {
    "id": "swebench-val-002",
    "text": "scikit-learn__scikit-learn-25638: HistGradientBoostingClassifier does not accept dataframes with feature names containing special characters.\n\nWhen passing a pandas DataFrame with column names containing brackets or dots, fit() raises a ValueError.\n\nSteps to reproduce:\n```python\nimport pandas as pd\nfrom sklearn.ensemble import HistGradientBoostingClassifier\ndf = pd.DataFrame({'feature[0]': [1,2,3], 'target': [0,1,0]})\nclf = HistGradientBoostingClassifier()\nclf.fit(df[['feature[0]']], df['target'])\n```\nRaises: ValueError: Feature names must match pattern '^[a-zA-Z0-9_]+$'",
    "repo_name": "scikit-learn/scikit-learn",
    "test_patch": "diff --git a/sklearn/tests/test_common.py b/sklearn/tests/test_common.py\nindex 4f2a891b2..e83c90d17 100644\n--- a/sklearn/tests/test_common.py\n+++ b/sklearn/tests/test_common.py\n@@ -421,6 +421,15 @@ def test_estimators_feature_names():\n+def test_feature_names_special_characters():\n+    pd = pytest.importorskip('pandas')\n+    X = pd.DataFrame({'col[0]': [1, 2, 3], 'col.1': [4, 5, 6]})\n+    y = [0, 1, 0]\n+    est = HistGradientBoostingClassifier(max_iter=1)\n+    est.fit(X, y)\n+    assert est.feature_names_in_[0] == 'col[0]'\n+    assert est.feature_names_in_[1] == 'col.1'",
    "gold_patch": "diff --git a/sklearn/utils/validation.py b/sklearn/utils/validation.py\nindex 72ef2a50c..a5b3f891d 100644\n--- a/sklearn/utils/validation.py\n+++ b/sklearn/utils/validation.py\n@@ -1843,8 +1843,7 @@ def _check_feature_names(X, *, reset, feature_names_out=None):\n     if hasattr(X, 'columns'):\n         feature_names = np.asarray(X.columns, dtype=object)\n-        pattern = re.compile(r'^[a-zA-Z0-9_]+$')\n-        invalid = [name for name in feature_names if not pattern.match(name)]\n-        if invalid:\n-            raise ValueError(f'Feature names must match...')\n+        # Accept any string feature names - special characters are valid\n+        pass"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/swebench-verified-validation
potato start config.yaml

Dataset & paper

Chowdhury et al., OpenAI Technical Report 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{chowdhury2024swebenchverified, title={SWE-bench Verified: A Manually Validated Subset of SWE-bench}, author={Neil Chowdhury and James Aung and Chan Jun Shern and others}, journal={OpenAI Technical Report}, year={2024}}

Details

Annotation Types

radiotext

Domain

Software EngineeringCode Generation

Use Cases

Benchmark ValidationIssue Triage

Related Designs

BigCodeBench Human Baseline Evaluation

Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.

radiolikert

DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.

multirateradio

SWE-Bench+ Patch Screening

Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.

radiomultiselect