SWE-bench Verified Issue Validation
Manually validate GitHub issues from SWE-bench to ensure they are well-specified, have adequate test patches, and are solvable. Annotators review the issue description, test patch, and gold patch to determine quality of each benchmark instance.
Configuration Fileconfig.yaml
# SWE-bench Verified Issue Validation
# Based on "SWE-bench Verified" (Neil Chowdhury, James Aung, Chan Jun Shern et al., OpenAI Technical Report 2024)
# Task: Validate GitHub issues from SWE-bench for specification quality, test adequacy, and solvability
annotation_task_name: "SWE-bench Verified Issue Validation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1400px; margin: 0 auto;">
<div style="background: #0d1117; color: #c9d1d9; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px; font-weight: 600;">
<span style="color: #58a6ff;">{{repo_name}}</span>
</div>
<div style="display: flex; gap: 16px; margin-top: 4px;">
<div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; padding: 16px; background: #f6f8fa;">
<h3 style="margin-top: 0; color: #24292f; border-bottom: 1px solid #d0d7de; padding-bottom: 8px;">Issue Description</h3>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
</div>
<div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
<div style="background: #161b22; color: #c9d1d9; padding: 8px 16px; font-weight: 600; font-size: 13px;">Test Patch</div>
<pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{test_patch}}</pre>
</div>
</div>
<div style="margin-top: 12px; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
<div style="background: #161b22; color: #c9d1d9; padding: 8px 16px; font-weight: 600; font-size: 13px;">Gold Patch (Reference Fix)</div>
<pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{gold_patch}}</pre>
</div>
</div>
annotation_schemes:
- name: "issue_valid"
description: "Is the issue well-specified with a clear, reproducible problem?"
annotation_type: radio
labels:
- "Well-specified — clear problem with reproducible steps"
- "Ambiguous — issue unclear or underspecified"
- "Invalid — not a real bug or feature request"
keyboard_shortcuts:
"Well-specified — clear problem with reproducible steps": "1"
"Ambiguous — issue unclear or underspecified": "2"
"Invalid — not a real bug or feature request": "3"
- name: "test_adequate"
description: "Do the tests adequately verify the fix?"
annotation_type: radio
labels:
- "Sufficient — tests verify the fix completely"
- "Partial — tests cover some aspects"
- "Insufficient — tests don't adequately verify"
keyboard_shortcuts:
"Sufficient — tests verify the fix completely": "4"
"Partial — tests cover some aspects": "5"
"Insufficient — tests don't adequately verify": "6"
- name: "solution_exists"
description: "Is a solution feasible within the existing codebase?"
annotation_type: radio
labels:
- "Solvable — clear fix exists in codebase"
- "Likely solvable — fix probable but complex"
- "Unlikely — may require architectural changes"
- "Unsolvable — impossible given constraints"
keyboard_shortcuts:
"Solvable — clear fix exists in codebase": "7"
"Likely solvable — fix probable but complex": "8"
"Unlikely — may require architectural changes": "9"
"Unsolvable — impossible given constraints": "0"
- name: "validation_notes"
description: "Explain your validation reasoning"
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "swebench-val-001",
"text": "django__django-16527: QuerySet.only() doesn't work with select_related() on reverse OneToOneField relation.\n\nWhen using .only() with .select_related() on a reverse OneToOneField, Django generates a query that includes all fields instead of only the specified ones.\n\nSteps to reproduce:\n1. Create models with OneToOneField relationship\n2. Use queryset.select_related('reverse_relation').only('id', 'reverse_relation__name')\n3. Inspect the generated SQL\n\nExpected: SELECT only specified columns\nActual: SELECT includes all columns from both tables",
"repo_name": "django/django",
"test_patch": "diff --git a/tests/select_related_onetoone/tests.py b/tests/select_related_onetoone/tests.py\nindex 3a4e512f8a..b7c2d91e03 100644\n--- a/tests/select_related_onetoone/tests.py\n+++ b/tests/select_related_onetoone/tests.py\n@@ -187,6 +187,18 @@ class ReverseSelectRelatedTestCase(TestCase):\n+ def test_only_with_select_related_reverse_onetoone(self):\n+ with self.assertNumQueries(1):\n+ qs = UserProfile.objects.select_related('user').only(\n+ 'id', 'user__username'\n+ )\n+ result = list(qs)\n+ self.assertEqual(len(result), 1)\n+ query_str = str(qs.query)\n+ self.assertNotIn('email', query_str)\n+ self.assertNotIn('first_name', query_str)",
"gold_patch": "diff --git a/django/db/models/sql/compiler.py b/django/db/models/sql/compiler.py\nindex 8e4a37b2ec..f3c5d12a91 100644\n--- a/django/db/models/sql/compiler.py\n+++ b/django/db/models/sql/compiler.py\n@@ -1042,7 +1042,10 @@ class SQLCompiler:\n if opts.proxy:\n return self.deferred_to_columns_cb(opts.proxy_for_model._meta, start_alias)\n- if start_alias:\n+ if start_alias and self.query.deferred_loading[0]:\n+ only_load = self.query.deferred_loading[0]\n+ fields = [f for f in opts.concrete_fields if f.attname in only_load]\n+ return {start_alias: {f.column for f in fields}}\n columns = {start_alias: set()}\n for f in opts.concrete_fields:\n if f.column in columns[start_alias]:"
},
{
"id": "swebench-val-002",
"text": "scikit-learn__scikit-learn-25638: HistGradientBoostingClassifier does not accept dataframes with feature names containing special characters.\n\nWhen passing a pandas DataFrame with column names containing brackets or dots, fit() raises a ValueError.\n\nSteps to reproduce:\n```python\nimport pandas as pd\nfrom sklearn.ensemble import HistGradientBoostingClassifier\ndf = pd.DataFrame({'feature[0]': [1,2,3], 'target': [0,1,0]})\nclf = HistGradientBoostingClassifier()\nclf.fit(df[['feature[0]']], df['target'])\n```\nRaises: ValueError: Feature names must match pattern '^[a-zA-Z0-9_]+$'",
"repo_name": "scikit-learn/scikit-learn",
"test_patch": "diff --git a/sklearn/tests/test_common.py b/sklearn/tests/test_common.py\nindex 4f2a891b2..e83c90d17 100644\n--- a/sklearn/tests/test_common.py\n+++ b/sklearn/tests/test_common.py\n@@ -421,6 +421,15 @@ def test_estimators_feature_names():\n+def test_feature_names_special_characters():\n+ pd = pytest.importorskip('pandas')\n+ X = pd.DataFrame({'col[0]': [1, 2, 3], 'col.1': [4, 5, 6]})\n+ y = [0, 1, 0]\n+ est = HistGradientBoostingClassifier(max_iter=1)\n+ est.fit(X, y)\n+ assert est.feature_names_in_[0] == 'col[0]'\n+ assert est.feature_names_in_[1] == 'col.1'",
"gold_patch": "diff --git a/sklearn/utils/validation.py b/sklearn/utils/validation.py\nindex 72ef2a50c..a5b3f891d 100644\n--- a/sklearn/utils/validation.py\n+++ b/sklearn/utils/validation.py\n@@ -1843,8 +1843,7 @@ def _check_feature_names(X, *, reset, feature_names_out=None):\n if hasattr(X, 'columns'):\n feature_names = np.asarray(X.columns, dtype=object)\n- pattern = re.compile(r'^[a-zA-Z0-9_]+$')\n- invalid = [name for name in feature_names if not pattern.match(name)]\n- if invalid:\n- raise ValueError(f'Feature names must match...')\n+ # Accept any string feature names - special characters are valid\n+ pass"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/swebench-verified-validation potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
BigCodeBench Human Baseline Evaluation
Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.
DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.
SWE-Bench+ Patch Screening
Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.