Skip to content
Showcase/SWE-Bench+ Patch Screening
advancedcomparison

SWE-Bench+ Patch Screening

Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# SWE-Bench+ Patch Screening
# Based on "SWE-Bench+: Enhanced Coding Benchmark for LLMs" (Aleithan et al., arXiv 2024)
# Task: Screen model-generated patches against gold patches for correctness and quality

annotation_task_name: "SWE-Bench+ Patch Screening"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1500px; margin: 0 auto;">
    <div style="background: #0d1117; color: #c9d1d9; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px;">
      <span style="font-weight: 600; color: #58a6ff;">{{repo_name}}</span>
      <span style="margin-left: 12px; color: #8b949e;">Test Results: {{test_results}}</span>
    </div>
    <div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin: 8px 0; background: #f6f8fa;">
      <h3 style="margin-top: 0; color: #24292f;">Issue Description</h3>
      <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
    </div>
    <div style="display: flex; gap: 12px; margin-top: 8px;">
      <div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
        <div style="background: #1a2233; color: #58a6ff; padding: 8px 16px; font-weight: 600; font-size: 13px;">Model Patch</div>
        <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{model_patch}}</pre>
      </div>
      <div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
        <div style="background: #1a2d1a; color: #3fb950; padding: 8px 16px; font-weight: 600; font-size: 13px;">Gold Patch</div>
        <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{gold_patch}}</pre>
      </div>
    </div>
  </div>

annotation_schemes:
  - name: "patch_correctness"
    description: "Does the model patch correctly fix the issue?"
    annotation_type: radio
    labels:
      - "Correct — properly fixes the issue"
      - "Partially Correct — fixes some but not all aspects"
      - "Incorrect — does not fix the issue"
      - "Overfitting — passes tests but wrong approach"
    keyboard_shortcuts:
      "Correct — properly fixes the issue": "1"
      "Partially Correct — fixes some but not all aspects": "2"
      "Incorrect — does not fix the issue": "3"
      "Overfitting — passes tests but wrong approach": "4"

  - name: "patch_issues"
    description: "Select all issues present in the model patch"
    annotation_type: multiselect
    labels:
      - "Missing Edge Case"
      - "Incorrect Logic"
      - "Style Violation"
      - "Unnecessary Changes"
      - "Incomplete Fix"
      - "Test Overfitting"
      - "No Issues"

  - name: "patch_preference"
    description: "Which patch is better overall?"
    annotation_type: pairwise
    labels:
      - "Model Patch"
      - "Gold Patch"
      - "Equivalent"

  - name: "screening_notes"
    description: "Detailed screening observations"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "swebp-001",
    "text": "pandas-dev/pandas#52345: DataFrame.merge() produces incorrect result when merging on nullable Int64 column with NaN values.\n\nWhen merging two DataFrames on a column with Int64 (nullable integer) dtype that contains pd.NA values, the merge incorrectly includes rows where both sides have NA, treating NA == NA as True.",
    "repo_name": "pandas-dev/pandas",
    "model_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..c4d590a1 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,8 @@ class _MergeOperation:\n     def _get_join_indexers(self):\n         left_keys = self.left_join_keys\n         right_keys = self.right_join_keys\n+        # Filter out NA values for nullable integer dtypes\n+        if hasattr(left_keys[0], '_mask'):\n+            mask_l = ~left_keys[0]._mask\n+            mask_r = ~right_keys[0]._mask\n+            left_keys = [k[mask_l] for k in left_keys]\n+            right_keys = [k[mask_r] for k in right_keys]\n         return get_join_indexers(",
    "gold_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..d7e2b301 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,14 @@ class _MergeOperation:\n     def _get_join_indexers(self):\n         left_keys = self.left_join_keys\n         right_keys = self.right_join_keys\n+        # Handle nullable dtypes: NA should not match NA\n+        for i, (lk, rk) in enumerate(zip(left_keys, right_keys)):\n+            if isinstance(lk, BaseMaskedArray):\n+                left_keys[i] = lk._data.copy()\n+                left_keys[i][lk._mask] = _get_na_sentinel(lk.dtype)\n+            if isinstance(rk, BaseMaskedArray):\n+                right_keys[i] = rk._data.copy()\n+                right_keys[i][rk._mask] = _get_na_sentinel(rk.dtype)\n         return get_join_indexers(",
    "test_results": "PASSED: 47/50 tests | FAILED: test_merge_nullable_int_na, test_merge_nat_handling, test_merge_masked_array"
  },
  {
    "id": "swebp-002",
    "text": "django/django#34521: Migration autodetector misses index changes when only the index condition changes.\n\nWhen modifying only the condition of a conditional index (partial index), makemigrations doesn't detect the change and produces no migration.",
    "repo_name": "django/django",
    "model_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..9c4d7801 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,7 +1134,7 @@ class MigrationAutodetector:\n     def generate_altered_indexes(self):\n         for app_label, model_name in sorted(self.kept_model_keys):\n             old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n             new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n-            if old_indexes != new_indexes:\n+            if str(old_indexes) != str(new_indexes):\n                 self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
    "gold_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..a5c8e901 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,8 +1134,12 @@ class MigrationAutodetector:\n     def generate_altered_indexes(self):\n         for app_label, model_name in sorted(self.kept_model_keys):\n             old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n             new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n-            if old_indexes != new_indexes:\n+            old_index_map = {idx.name: idx for idx in old_indexes}\n+            new_index_map = {idx.name: idx for idx in new_indexes}\n+            # Compare using deconstruct() to catch condition changes\n+            if any(\n+                old_index_map.get(name, None) is None or\n+                idx.deconstruct() != old_index_map[name].deconstruct()\n+                for name, idx in new_index_map.items()\n+            ) or set(old_index_map) != set(new_index_map):\n                 self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
    "test_results": "PASSED: 51/51 tests"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/swebench-plus-patch-screening
potato start config.yaml

Details

Annotation Types

radiomultiselectpairwisetext

Domain

Software EngineeringCode Generation

Use Cases

Patch EvaluationCode Review

Tags

swe-benchpatch-screeningcode-diffbenchmarkagentic-coding

Found an issue or want to improve this design?

Open an Issue