SWE-Bench+ Patch Screening
Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.
Configuration Fileconfig.yaml
This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.
# SWE-Bench+ Patch Screening
# Based on "SWE-Bench+: Enhanced Coding Benchmark for LLMs" (Aleithan et al., arXiv 2024)
# Task: Screen model-generated patches against gold patches for correctness and quality
annotation_task_name: "SWE-Bench+ Patch Screening"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1500px; margin: 0 auto;">
<div style="background: #0d1117; color: #c9d1d9; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px;">
<span style="font-weight: 600; color: #58a6ff;">{{repo_name}}</span>
<span style="margin-left: 12px; color: #8b949e;">Test Results: {{test_results}}</span>
</div>
<div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin: 8px 0; background: #f6f8fa;">
<h3 style="margin-top: 0; color: #24292f;">Issue Description</h3>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
</div>
<div style="display: flex; gap: 12px; margin-top: 8px;">
<div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
<div style="background: #1a2233; color: #58a6ff; padding: 8px 16px; font-weight: 600; font-size: 13px;">Model Patch</div>
<pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{model_patch}}</pre>
</div>
<div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
<div style="background: #1a2d1a; color: #3fb950; padding: 8px 16px; font-weight: 600; font-size: 13px;">Gold Patch</div>
<pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{gold_patch}}</pre>
</div>
</div>
</div>
annotation_schemes:
- name: "patch_correctness"
description: "Does the model patch correctly fix the issue?"
annotation_type: radio
labels:
- "Correct — properly fixes the issue"
- "Partially Correct — fixes some but not all aspects"
- "Incorrect — does not fix the issue"
- "Overfitting — passes tests but wrong approach"
keyboard_shortcuts:
"Correct — properly fixes the issue": "1"
"Partially Correct — fixes some but not all aspects": "2"
"Incorrect — does not fix the issue": "3"
"Overfitting — passes tests but wrong approach": "4"
- name: "patch_issues"
description: "Select all issues present in the model patch"
annotation_type: multiselect
labels:
- "Missing Edge Case"
- "Incorrect Logic"
- "Style Violation"
- "Unnecessary Changes"
- "Incomplete Fix"
- "Test Overfitting"
- "No Issues"
- name: "patch_preference"
description: "Which patch is better overall?"
annotation_type: pairwise
labels:
- "Model Patch"
- "Gold Patch"
- "Equivalent"
- name: "screening_notes"
description: "Detailed screening observations"
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "swebp-001",
"text": "pandas-dev/pandas#52345: DataFrame.merge() produces incorrect result when merging on nullable Int64 column with NaN values.\n\nWhen merging two DataFrames on a column with Int64 (nullable integer) dtype that contains pd.NA values, the merge incorrectly includes rows where both sides have NA, treating NA == NA as True.",
"repo_name": "pandas-dev/pandas",
"model_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..c4d590a1 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,8 @@ class _MergeOperation:\n def _get_join_indexers(self):\n left_keys = self.left_join_keys\n right_keys = self.right_join_keys\n+ # Filter out NA values for nullable integer dtypes\n+ if hasattr(left_keys[0], '_mask'):\n+ mask_l = ~left_keys[0]._mask\n+ mask_r = ~right_keys[0]._mask\n+ left_keys = [k[mask_l] for k in left_keys]\n+ right_keys = [k[mask_r] for k in right_keys]\n return get_join_indexers(",
"gold_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..d7e2b301 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,14 @@ class _MergeOperation:\n def _get_join_indexers(self):\n left_keys = self.left_join_keys\n right_keys = self.right_join_keys\n+ # Handle nullable dtypes: NA should not match NA\n+ for i, (lk, rk) in enumerate(zip(left_keys, right_keys)):\n+ if isinstance(lk, BaseMaskedArray):\n+ left_keys[i] = lk._data.copy()\n+ left_keys[i][lk._mask] = _get_na_sentinel(lk.dtype)\n+ if isinstance(rk, BaseMaskedArray):\n+ right_keys[i] = rk._data.copy()\n+ right_keys[i][rk._mask] = _get_na_sentinel(rk.dtype)\n return get_join_indexers(",
"test_results": "PASSED: 47/50 tests | FAILED: test_merge_nullable_int_na, test_merge_nat_handling, test_merge_masked_array"
},
{
"id": "swebp-002",
"text": "django/django#34521: Migration autodetector misses index changes when only the index condition changes.\n\nWhen modifying only the condition of a conditional index (partial index), makemigrations doesn't detect the change and produces no migration.",
"repo_name": "django/django",
"model_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..9c4d7801 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,7 +1134,7 @@ class MigrationAutodetector:\n def generate_altered_indexes(self):\n for app_label, model_name in sorted(self.kept_model_keys):\n old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n- if old_indexes != new_indexes:\n+ if str(old_indexes) != str(new_indexes):\n self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
"gold_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..a5c8e901 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,8 +1134,12 @@ class MigrationAutodetector:\n def generate_altered_indexes(self):\n for app_label, model_name in sorted(self.kept_model_keys):\n old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n- if old_indexes != new_indexes:\n+ old_index_map = {idx.name: idx for idx in old_indexes}\n+ new_index_map = {idx.name: idx for idx in new_indexes}\n+ # Compare using deconstruct() to catch condition changes\n+ if any(\n+ old_index_map.get(name, None) is None or\n+ idx.deconstruct() != old_index_map[name].deconstruct()\n+ for name, idx in new_index_map.items()\n+ ) or set(old_index_map) != set(new_index_map):\n self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
"test_results": "PASSED: 51/51 tests"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/swebench-plus-patch-screening potato start config.yaml
Dataset & paper
Aleithan et al., arXiv 2024
Citation (BibTeX)
@article{aleithan2024swebenchplus, title={SWE-Bench+: Enhanced Coding Benchmark for LLMs}, author={Reem Aleithan and Haoran Xue and Mohammad Mahdi Mohajer and Elijah Nnorom and Gias Uddin and Song Wang}, journal={arXiv preprint arXiv:2410.06992}, year={2024}}Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.
SWE-PRM Process Reward Labels for Coding Agents
Step-level process reward dataset for coding agents from the SWE-PRM paper (Gandhi et al., 2025). The Potato config reproduces per-step correctness rating, error typing, and explanation over SWE-bench Verified traces.
BigCodeBench Human Baseline Evaluation
Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.