SWE-Bench+ Patch Screening

Screen and compare model-generated patches against gold patches for SWE-Bench+ instances. Annotators evaluate correctness, identify specific issues, and compare model vs. gold solutions side-by-side.

Configuration Fileconfig.yaml

# SWE-Bench+ Patch Screening
# Based on "SWE-Bench+: Enhanced Coding Benchmark for LLMs" (Aleithan et al., arXiv 2024)
# Task: Screen model-generated patches against gold patches for correctness and quality

annotation_task_name: "SWE-Bench+ Patch Screening"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1500px; margin: 0 auto;">
    <div style="background: #0d1117; color: #c9d1d9; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px;">
      <span style="font-weight: 600; color: #58a6ff;">{{repo_name}}</span>
      <span style="margin-left: 12px; color: #8b949e;">Test Results: {{test_results}}</span>
    </div>
    <div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin: 8px 0; background: #f6f8fa;">
      <h3 style="margin-top: 0; color: #24292f;">Issue Description</h3>
      <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
    </div>
    <div style="display: flex; gap: 12px; margin-top: 8px;">
      <div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
        <div style="background: #1a2233; color: #58a6ff; padding: 8px 16px; font-weight: 600; font-size: 13px;">Model Patch</div>
        <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{model_patch}}</pre>
      </div>
      <div style="flex: 1; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
        <div style="background: #1a2d1a; color: #3fb950; padding: 8px 16px; font-weight: 600; font-size: 13px;">Gold Patch</div>
        <pre style="margin: 0; padding: 12px 16px; background: #0d1117; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{gold_patch}}</pre>
      </div>
    </div>
  </div>

annotation_schemes:
  - name: "patch_correctness"
    description: "Does the model patch correctly fix the issue?"
    annotation_type: radio
    labels:
      - "Correct — properly fixes the issue"
      - "Partially Correct — fixes some but not all aspects"
      - "Incorrect — does not fix the issue"
      - "Overfitting — passes tests but wrong approach"
    keyboard_shortcuts:
      "Correct — properly fixes the issue": "1"
      "Partially Correct — fixes some but not all aspects": "2"
      "Incorrect — does not fix the issue": "3"
      "Overfitting — passes tests but wrong approach": "4"

  - name: "patch_issues"
    description: "Select all issues present in the model patch"
    annotation_type: multiselect
    labels:
      - "Missing Edge Case"
      - "Incorrect Logic"
      - "Style Violation"
      - "Unnecessary Changes"
      - "Incomplete Fix"
      - "Test Overfitting"
      - "No Issues"

  - name: "patch_preference"
    description: "Which patch is better overall?"
    annotation_type: pairwise
    labels:
      - "Model Patch"
      - "Gold Patch"
      - "Equivalent"

  - name: "screening_notes"
    description: "Detailed screening observations"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "swebp-001",
    "text": "pandas-dev/pandas#52345: DataFrame.merge() produces incorrect result when merging on nullable Int64 column with NaN values.\n\nWhen merging two DataFrames on a column with Int64 (nullable integer) dtype that contains pd.NA values, the merge incorrectly includes rows where both sides have NA, treating NA == NA as True.",
    "repo_name": "pandas-dev/pandas",
    "model_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..c4d590a1 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,8 @@ class _MergeOperation:\n     def _get_join_indexers(self):\n         left_keys = self.left_join_keys\n         right_keys = self.right_join_keys\n+        # Filter out NA values for nullable integer dtypes\n+        if hasattr(left_keys[0], '_mask'):\n+            mask_l = ~left_keys[0]._mask\n+            mask_r = ~right_keys[0]._mask\n+            left_keys = [k[mask_l] for k in left_keys]\n+            right_keys = [k[mask_r] for k in right_keys]\n         return get_join_indexers(",
    "gold_patch": "diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py\nindex 8a3f1b2e..d7e2b301 100644\n--- a/pandas/core/reshape/merge.py\n+++ b/pandas/core/reshape/merge.py\n@@ -1672,6 +1672,14 @@ class _MergeOperation:\n     def _get_join_indexers(self):\n         left_keys = self.left_join_keys\n         right_keys = self.right_join_keys\n+        # Handle nullable dtypes: NA should not match NA\n+        for i, (lk, rk) in enumerate(zip(left_keys, right_keys)):\n+            if isinstance(lk, BaseMaskedArray):\n+                left_keys[i] = lk._data.copy()\n+                left_keys[i][lk._mask] = _get_na_sentinel(lk.dtype)\n+            if isinstance(rk, BaseMaskedArray):\n+                right_keys[i] = rk._data.copy()\n+                right_keys[i][rk._mask] = _get_na_sentinel(rk.dtype)\n         return get_join_indexers(",
    "test_results": "PASSED: 47/50 tests | FAILED: test_merge_nullable_int_na, test_merge_nat_handling, test_merge_masked_array"
  },
  {
    "id": "swebp-002",
    "text": "django/django#34521: Migration autodetector misses index changes when only the index condition changes.\n\nWhen modifying only the condition of a conditional index (partial index), makemigrations doesn't detect the change and produces no migration.",
    "repo_name": "django/django",
    "model_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..9c4d7801 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,7 +1134,7 @@ class MigrationAutodetector:\n     def generate_altered_indexes(self):\n         for app_label, model_name in sorted(self.kept_model_keys):\n             old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n             new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n-            if old_indexes != new_indexes:\n+            if str(old_indexes) != str(new_indexes):\n                 self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
    "gold_patch": "diff --git a/django/db/migrations/autodetector.py b/django/db/migrations/autodetector.py\nindex 1b3e5a2f..a5c8e901 100644\n--- a/django/db/migrations/autodetector.py\n+++ b/django/db/migrations/autodetector.py\n@@ -1134,8 +1134,12 @@ class MigrationAutodetector:\n     def generate_altered_indexes(self):\n         for app_label, model_name in sorted(self.kept_model_keys):\n             old_indexes = self.from_state.models[app_label, model_name].options.get('indexes', [])\n             new_indexes = self.to_state.models[app_label, model_name].options.get('indexes', [])\n-            if old_indexes != new_indexes:\n+            old_index_map = {idx.name: idx for idx in old_indexes}\n+            new_index_map = {idx.name: idx for idx in new_indexes}\n+            # Compare using deconstruct() to catch condition changes\n+            if any(\n+                old_index_map.get(name, None) is None or\n+                idx.deconstruct() != old_index_map[name].deconstruct()\n+                for name, idx in new_index_map.items()\n+            ) or set(old_index_map) != set(new_index_map):\n                 self._generate_altered_indexes(app_label, model_name, old_indexes, new_indexes)",
    "test_results": "PASSED: 51/51 tests"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/swebench-plus-patch-screening
potato start config.yaml

Details

Annotation Types

radiomultiselectpairwisetext

Domain

Software EngineeringCode Generation

Use Cases

Patch EvaluationCode Review

Related Designs

RefactorBench Multi-File Evaluation

Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.

radiomultiselect

SWE-PRM Coding Process Reward

Process reward annotation for software engineering agent traces. Annotators verify each coding action step taken by an SWE agent when resolving GitHub issues, identifying the first step where the agent goes astray and classifying the error type.