SWE-Bench+ Patch Screening
Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.
Configuration Fileconfig.yaml
# SWE-Bench+ Patch Screening
# Based on "SWE-Bench+: Enhanced Coding Benchmark for LLMs" (Aleithan et al., arXiv 2024)
# Task: Screen model-generated patches against gold patches for correctness and quality
annotation_task_name: "SWE-Bench+ Patch Screening"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1500px; margin: 0 auto;">
<div style="background: #0d1117; color: #c9d1d9; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px;">
<span style="font-weight: 600; color: #58a6ff;">{{repo_name}}</span>
<span style="margin-left: 12px; color: #8b949e;">Test Results: {{test_results}}</span>
</div>
<div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin: 8px 0; background: #f6f8fa;">
<h3 style="margin-top: 0; color: #24292f;">Issue Description</h3>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
</div>
<div style="display: flex; gap: 12px; margin-top: 8px;">
<div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
<div style="background: #1a2233; color: #58a6ff; padding: 8px 16px; font-weight: 600; font-size: 13px;">Model Patch</div>
<pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{model_patch}}</pre>
</div>
<div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
<div style="background: #1a2d1a; color: #3fb950; padding: 8px 16px; font-weight: 600; font-size: 13px;">Gold Patch</div>
<pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{gold_patch}}</pre>
</div>
</div>
</div>
annotation_schemes:
- name: "patch_correctness"
description: "Does the model patch correctly fix the issue?"
annotation_type: radio
labels:
- "Correct — properly fixes the issue"
- "Partially Correct — fixes some but not all aspects"
- "Incorrect — does not fix the issue"
- "Overfitting — passes tests but wrong approach"
keyboard_shortcuts:
"Correct — properly fixes the issue": "1"
"Partially Correct — fixes some but not all aspects": "2"
"Incorrect — does not fix the issue": "3"
"Overfitting — passes tests but wrong approach": "4"
- name: "patch_issues"
description: "Select all issues present in the model patch"
annotation_type: multiselect
labels:
- "Missing Edge Case"
- "Incorrect Logic"
- "Style Violation"
- "Unnecessary Changes"
- "Incomplete Fix"
- "Test Overfitting"
- "No Issues"
- name: "patch_preference"
description: "Which patch is better overall?"
annotation_type: pairwise
labels:
- "Model Patch"
- "Gold Patch"
- "Equivalent"
- name: "screening_notes"
description: "Detailed screening observations"
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "swebp-001",
"text": "pandas-dev/pandas#52345: DataFrame.merge() produces incorrect result when merging on nullable Int64 column with NaN values.\n\nWhen merging two DataFrames on a column with Int64 (nullable integer) dtype that contains pd.NA values, the merge incorrectly includes rows where both sides have NA, treating NA == NA as True.",
"repo_name": "pandas-dev/pandas",
"model_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..c4d590a1 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,8 @@ class _MergeOperation:\n def _get_join_indexers(self):\n left_keys = self.left_join_keys\n right_keys = self.right_join_keys\n+ # Filter out NA values for nullable integer dtypes\n+ if hasattr(left_keys[0], '_mask'):\n+ mask_l = ~left_keys[0]._mask\n+ mask_r = ~right_keys[0]._mask\n+ left_keys = [k[mask_l] for k in left_keys]\n+ right_keys = [k[mask_r] for k in right_keys]\n return get_join_indexers(",
"gold_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..d7e2b301 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,14 @@ class _MergeOperation:\n def _get_join_indexers(self):\n left_keys = self.left_join_keys\n right_keys = self.right_join_keys\n+ # Handle nullable dtypes: NA should not match NA\n+ for i, (lk, rk) in enumerate(zip(left_keys, right_keys)):\n+ if isinstance(lk, BaseMaskedArray):\n+ left_keys[i] = lk._data.copy()\n+ left_keys[i][lk._mask] = _get_na_sentinel(lk.dtype)\n+ if isinstance(rk, BaseMaskedArray):\n+ right_keys[i] = rk._data.copy()\n+ right_keys[i][rk._mask] = _get_na_sentinel(rk.dtype)\n return get_join_indexers(",
"test_results": "PASSED: 47/50 tests | FAILED: test_merge_nullable_int_na, test_merge_nat_handling, test_merge_masked_array"
},
{
"id": "swebp-002",
"text": "django/django#34521: Migration autodetector misses index changes when only the index condition changes.\n\nWhen modifying only the condition of a conditional index (partial index), makemigrations doesn't detect the change and produces no migration.",
"repo_name": "django/django",
"model_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..9c4d7801 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,7 +1134,7 @@ class MigrationAutodetector:\n def generate_altered_indexes(self):\n for app_label, model_name in sorted(self.kept_model_keys):\n old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n- if old_indexes != new_indexes:\n+ if str(old_indexes) != str(new_indexes):\n self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
"gold_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..a5c8e901 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,8 +1134,12 @@ class MigrationAutodetector:\n def generate_altered_indexes(self):\n for app_label, model_name in sorted(self.kept_model_keys):\n old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n- if old_indexes != new_indexes:\n+ old_index_map = {idx.name: idx for idx in old_indexes}\n+ new_index_map = {idx.name: idx for idx in new_indexes}\n+ # Compare using deconstruct() to catch condition changes\n+ if any(\n+ old_index_map.get(name, None) is None or\n+ idx.deconstruct() != old_index_map[name].deconstruct()\n+ for name, idx in new_index_map.items()\n+ ) or set(old_index_map) != set(new_index_map):\n self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
"test_results": "PASSED: 51/51 tests"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/swebench-plus-patch-screening potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.
SWE-PRM Coding Process Reward
Process reward annotation for software engineering agent traces. Annotators verify each coding action step taken by an SWE agent when resolving GitHub issues, identifying the first step where the agent goes astray and classifying the error type.
BigCodeBench Human Baseline Evaluation
Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.