DevBench Repository Evaluation
Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.
Configuration Fileconfig.yaml
# DevBench Repository Evaluation
# Based on "Prompting Large Language Models to Tackle the Full Software Development Lifecycle" (Li et al., arXiv 2024)
# Task: Evaluate AI-generated repositories for architecture, code quality, testing, documentation, and dependencies
annotation_task_name: "DevBench Repository Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1500px; margin: 0 auto;">
<div style="background: #24292f; color: #ffffff; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px; font-weight: 600;">
Project: {{language}} Repository
</div>
<div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin: 8px 0; background: #f6f8fa;">
<h3 style="margin-top: 0; color: #24292f;">Project Specification</h3>
<div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
</div>
<div style="display: flex; gap: 12px; margin-top: 8px;">
<div style="flex: 0 0 280px; border: 1px solid #d0d7de; border-radius: 6px; overflow: hidden;">
<div style="background: #2d333b; color: #adbac7; padding: 8px 12px; font-weight: 600; font-size: 13px;">Repository Structure</div>
<pre style="margin: 0; padding: 12px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.6; overflow-x: auto; white-space: pre;">{{repo_structure}}</pre>
</div>
<div style="flex: 1; border: 1px solid #d0d7de; border-radius: 6px; overflow: hidden;">
<div style="background: #2d333b; color: #adbac7; padding: 8px 12px; font-weight: 600; font-size: 13px;">Key Source Files</div>
<pre style="margin: 0; padding: 12px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{key_files}}</pre>
</div>
</div>
<div style="margin-top: 8px; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
<div style="background: #0d1117; color: #3fb950; padding: 8px 12px; font-weight: 600; font-size: 13px;">Test Output</div>
<pre style="margin: 0; padding: 12px; background: #161b22; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{test_output}}</pre>
</div>
</div>
annotation_schemes:
- name: "repo_criteria"
description: "Rate the repository on each quality dimension."
annotation_type: multirate
labels:
- "1 - Very Poor"
- "2 - Poor"
- "3 - Average"
- "4 - Good"
- "5 - Excellent"
options:
- "Architecture Design"
- "Code Quality"
- "Test Coverage"
- "Documentation"
- "Dependency Management"
- name: "overall_grade"
description: "Assign an overall letter grade to this repository"
annotation_type: radio
labels:
- "A — Excellent"
- "B — Good"
- "C — Average"
- "D — Below Average"
- "F — Failing"
keyboard_shortcuts:
"A — Excellent": "1"
"B — Good": "2"
"C — Average": "3"
"D — Below Average": "4"
"F — Failing": "5"
- name: "review_comments"
description: "Detailed code review notes referencing specific files"
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "devbench-001",
"text": "Build a URL shortener service in Python. Requirements:\n- REST API with endpoints: POST /shorten (create short URL), GET /{code} (redirect), GET /stats/{code} (click analytics)\n- SQLite database for persistence\n- Base62 encoding for short codes\n- Rate limiting (100 requests/minute per IP)\n- Click tracking with timestamp, referrer, and user-agent\n- Input validation for URLs\n- Docker support",
"repo_structure": "url-shortener/\n├── README.md\n├── requirements.txt\n├── Dockerfile\n├── docker-compose.yml\n├── setup.py\n├── src/\n│ ├── __init__.py\n│ ├── app.py\n│ ├── models.py\n│ ├── routes.py\n│ ├── encoder.py\n│ └── middleware.py\n├── tests/\n│ ├── __init__.py\n│ ├── test_routes.py\n│ ├── test_encoder.py\n│ └── test_models.py\n└── migrations/\n └── 001_initial.sql",
"key_files": "# src/app.py\nfrom flask import Flask\nfrom src.models import db\nfrom src.routes import api_bp\nfrom src.middleware import RateLimiter\n\ndef create_app(config=None):\n app = Flask(__name__)\n app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///urls.db'\n if config:\n app.config.update(config)\n db.init_app(app)\n app.register_blueprint(api_bp)\n RateLimiter(app, limit=100, window=60)\n return app\n\n# src/encoder.py\nimport string\n\nALPHABET = string.digits + string.ascii_letters # 62 chars\n\ndef encode(num: int) -> str:\n if num == 0:\n return ALPHABET[0]\n result = []\n while num:\n num, rem = divmod(num, 62)\n result.append(ALPHABET[rem])\n return ''.join(reversed(result))\n\ndef decode(code: str) -> int:\n num = 0\n for char in code:\n num = num * 62 + ALPHABET.index(char)\n return num",
"test_output": "$ python -m pytest tests/ -v\ntests/test_encoder.py::test_encode_zero PASSED\ntests/test_encoder.py::test_encode_roundtrip PASSED\ntests/test_encoder.py::test_encode_large_number PASSED\ntests/test_routes.py::test_shorten_valid_url PASSED\ntests/test_routes.py::test_shorten_invalid_url PASSED\ntests/test_routes.py::test_redirect PASSED\ntests/test_routes.py::test_stats_endpoint PASSED\ntests/test_routes.py::test_rate_limiting PASSED\ntests/test_models.py::test_create_url PASSED\ntests/test_models.py::test_click_tracking PASSED\n\n10 passed in 0.84s",
"language": "Python"
},
{
"id": "devbench-002",
"text": "Build a task queue system in Go. Requirements:\n- In-memory priority queue with persistent WAL (write-ahead log)\n- HTTP API: POST /enqueue, GET /dequeue, GET /status\n- Support task priorities (low, normal, high, critical)\n- Configurable retry with exponential backoff\n- Dead letter queue for failed tasks\n- Graceful shutdown with in-flight task completion\n- Prometheus metrics endpoint",
"repo_structure": "taskqueue/\n├── README.md\n├── go.mod\n├── go.sum\n├── Makefile\n├── cmd/\n│ └── server/\n│ └── main.go\n├── internal/\n│ ├── queue/\n│ │ ├── priority_queue.go\n│ │ ├── wal.go\n│ │ └── dlq.go\n│ ├── handler/\n│ │ └── api.go\n│ └── metrics/\n│ └── prometheus.go\n├── pkg/\n│ └── models/\n│ └── task.go\n└── tests/\n ├── queue_test.go\n ├── wal_test.go\n └── api_test.go",
"key_files": "// internal/queue/priority_queue.go\npackage queue\n\nimport (\n \"container/heap\"\n \"sync\"\n \"taskqueue/pkg/models\"\n)\n\ntype PriorityQueue struct {\n mu sync.RWMutex\n items []*models.Task\n wal *WAL\n}\n\nfunc (pq *PriorityQueue) Len() int { return len(pq.items) }\nfunc (pq *PriorityQueue) Less(i, j int) bool {\n return pq.items[i].Priority > pq.items[j].Priority\n}\nfunc (pq *PriorityQueue) Swap(i, j int) {\n pq.items[i], pq.items[j] = pq.items[j], pq.items[i]\n}\n\nfunc (pq *PriorityQueue) Enqueue(task *models.Task) error {\n pq.mu.Lock()\n defer pq.mu.Unlock()\n if err := pq.wal.Append(task); err != nil {\n return err\n }\n heap.Push(pq, task)\n return nil\n}\n\n// pkg/models/task.go\npackage models\n\nimport \"time\"\n\ntype Priority int\nconst (\n Low Priority = 0\n Normal Priority = 1\n High Priority = 2\n Critical Priority = 3\n)\n\ntype Task struct {\n ID string `json:\"id\"`\n Payload []byte `json:\"payload\"`\n Priority Priority `json:\"priority\"`\n Retries int `json:\"retries\"`\n MaxRetry int `json:\"max_retry\"`\n CreatedAt time.Time `json:\"created_at\"`\n}",
"test_output": "$ go test ./... -v\n=== RUN TestEnqueueDequeue\n--- PASS: TestEnqueueDequeue (0.00s)\n=== RUN TestPriorityOrdering\n--- PASS: TestPriorityOrdering (0.00s)\n=== RUN TestWALPersistence\n--- PASS: TestWALPersistence (0.02s)\n=== RUN TestWALRecovery\n--- PASS: TestWALRecovery (0.01s)\n=== RUN TestDeadLetterQueue\n--- PASS: TestDeadLetterQueue (0.00s)\n=== RUN TestAPIEnqueue\n--- PASS: TestAPIEnqueue (0.01s)\n=== RUN TestAPIDequeue\n--- PASS: TestAPIDequeue (0.00s)\n=== RUN TestRetryBackoff\n--- FAIL: TestRetryBackoff (0.05s)\n queue_test.go:89: expected backoff 4s, got 2s\n\n7 passed, 1 failed",
"language": "Go"
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/devbench-repo-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
TrajEval Staged Evaluation
Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.
AgentRewardBench Trajectory Scoring
Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.
BigCodeBench Human Baseline Evaluation
Evaluate agent-generated code solutions for BigCodeBench tasks. Annotators assess correctness against test suites, rate task complexity, evaluate code quality, and provide notes on the solution approach.