SWE-bench: Code Agent Solution Evaluation
Evaluation of code agent solutions to real GitHub issues. Annotators review code patches generated by AI agents, assess correctness, check test compatibility, and evaluate code quality.
Archivo de configuraciónconfig.yaml
# SWE-bench: Code Agent Solution Evaluation
# Based on "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (Jimenez et al., ICLR 2024)
# Task: Review code patches generated by AI agents for real GitHub issues
annotation_task_name: "SWE-bench Code Agent Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout showing issue, patch, and test results
html_layout: |
<div class="swebench-container">
<div class="repo-info" style="background: #e8eaf6; padding: 10px; border-radius: 8px; margin-bottom: 15px;">
<strong>Repository:</strong> {{repo_name}}
</div>
<div class="issue-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
<h3 style="margin-top: 0;">GitHub Issue:</h3>
<div class="issue-text" style="font-size: 15px; white-space: pre-wrap;">{{text}}</div>
</div>
<div class="patch-section" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #424242;">
<h3 style="margin-top: 0;">Agent-Generated Patch:</h3>
<pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.5; overflow-x: auto;">{{patch}}</pre>
</div>
<div class="test-section" style="background: #fff3e0; padding: 15px; border-radius: 8px; border: 2px solid #f9a825;">
<h3 style="margin-top: 0; color: #f9a825;">Test Results:</h3>
<pre style="white-space: pre-wrap; font-family: monospace; font-size: 13px;">{{test_results}}</pre>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Correctness assessment
- name: "correctness"
description: "Does the patch correctly fix the reported issue?"
annotation_type: radio
labels:
- "Correct - fully fixes the issue"
- "Partially Correct - fixes some aspects but not all"
- "Incorrect - does not fix the issue"
- "Introduces new bugs - fix creates other problems"
keyboard_shortcuts:
"Correct - fully fixes the issue": "1"
"Partially Correct - fixes some aspects but not all": "2"
"Incorrect - does not fix the issue": "3"
"Introduces new bugs - fix creates other problems": "4"
# Code quality assessment
- name: "code_quality"
description: "Rate the overall quality of the generated code."
annotation_type: radio
labels:
- "Clean - well-structured, follows conventions, maintainable"
- "Acceptable - works but could be improved"
- "Poor - messy, violates conventions, hard to maintain"
keyboard_shortcuts:
"Clean - well-structured, follows conventions, maintainable": "a"
"Acceptable - works but could be improved": "s"
"Poor - messy, violates conventions, hard to maintain": "d"
# Review comments
- name: "review_comments"
description: "Provide code review comments explaining your assessment. Note specific issues, suggestions for improvement, or edge cases."
annotation_type: text
min_length: 20
max_length: 600
placeholder: "Provide specific code review feedback: correctness issues, style concerns, edge cases missed, test coverage gaps..."
# Test compatibility
- name: "test_compatibility"
description: "Are the existing tests passing with this patch?"
annotation_type: radio
labels:
- "All tests pass"
- "Most tests pass (minor failures)"
- "Some tests fail"
- "Many tests fail"
- "Cannot determine from provided info"
keyboard_shortcuts:
"All tests pass": "z"
"Most tests pass (minor failures)": "x"
"Some tests fail": "c"
"Many tests fail": "v"
"Cannot determine from provided info": "b"
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2
Datos de ejemplosample-data.json
[
{
"id": "swe_001",
"text": "Issue #4521: `DataFrame.groupby().apply()` raises TypeError when the applied function returns a scalar value for groups with a single row.\n\nSteps to reproduce:\n```python\nimport pandas as pd\ndf = pd.DataFrame({'key': ['a', 'b', 'b'], 'value': [1, 2, 3]})\nresult = df.groupby('key').apply(lambda x: x['value'].sum())\n```\nExpected: Returns a Series with group sums.\nActual: TypeError: 'int' object is not iterable",
"patch": "--- a/pandas/core/groupby/groupby.py\n+++ b/pandas/core/groupby/groupby.py\n@@ -1245,7 +1245,10 @@ class GroupBy:\n def _apply_func(self, func, data):\n result = func(data)\n- if isinstance(result, DataFrame):\n+ if is_scalar(result):\n+ return Series([result], index=data.index[:1])\n+ elif isinstance(result, DataFrame):\n return result\n elif isinstance(result, Series):\n return result\n else:\n return Series(result)",
"repo_name": "pandas-dev/pandas",
"test_results": "PASSED: test_groupby_apply_scalar_return (3 subtests)\nPASSED: test_groupby_apply_dataframe_return\nPASSED: test_groupby_apply_series_return\nFAILED: test_groupby_apply_empty_groups - AssertionError: Expected empty DataFrame, got Series\n\nTotal: 15 passed, 1 failed, 0 errors"
},
{
"id": "swe_002",
"text": "Issue #8734: `json.dumps()` does not properly handle `datetime.date` objects, raising `TypeError: Object of type date is not JSON serializable`.\n\nThis should be handled gracefully with a default serializer or a clear error message suggesting a solution.",
"patch": "--- a/Lib/json/encoder.py\n+++ b/Lib/json/encoder.py\n@@ -178,6 +178,8 @@ class JSONEncoder:\n def default(self, o):\n if isinstance(o, datetime.datetime):\n return o.isoformat()\n+ elif isinstance(o, datetime.date):\n+ return o.isoformat()\n raise TypeError(f\"Object of type {type(o).__name__} is not JSON serializable. \"\n f\"Consider using a custom encoder or converting to a \"\n f\"serializable type first.\")",
"repo_name": "python/cpython",
"test_results": "PASSED: test_date_serialization\nPASSED: test_datetime_serialization\nPASSED: test_custom_encoder_still_works\nPASSED: test_error_message_improved\nPASSED: test_nested_date_objects\n\nTotal: 12 passed, 0 failed, 0 errors"
}
]
// ... and 8 more itemsObtener este diseño
Clone or download from the repository
Inicio rápido:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/swebench-code-agent-eval potato start config.yaml
Detalles
Tipos de anotación
Dominio
Casos de uso
Etiquetas
¿Encontró un problema o desea mejorar este diseño?
Abrir un issueDiseños relacionados
Code Review Annotation (CodeReviewer)
Annotation of code review activities based on the CodeReviewer benchmark. Annotators identify issues in code diffs, classify defect types, assign severity levels, make review decisions, and provide natural language review comments, supporting research in automated code review and software engineering.
FActScore: Fine-grained Atomic Evaluation of Factual Precision
Atomic fact evaluation in LLM-generated text. Annotators decompose generated text into atomic facts and verify each fact as supported, not-supported, or irrelevant against a reference source. Based on the FActScore framework for evaluating factual precision in long-form text generation.
GPQA - Graduate-Level Expert QA Evaluation
Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., ICLR 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.