Skip to content
Showcase/DevBench Repository Evaluation
advancedevaluation

DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# DevBench Repository Evaluation
# Based on "Prompting Large Language Models to Tackle the Full Software Development Lifecycle" (Li et al., arXiv 2024)
# Task: Evaluate AI-generated repositories for architecture, code quality, testing, documentation, and dependencies

annotation_task_name: "DevBench Repository Evaluation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1500px; margin: 0 auto;">
    <div style="background: #24292f; color: #ffffff; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px; font-weight: 600;">
      Project: {{language}} Repository
    </div>
    <div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin: 8px 0; background: #f6f8fa;">
      <h3 style="margin-top: 0; color: #24292f;">Project Specification</h3>
      <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
    </div>
    <div style="display: flex; gap: 12px; margin-top: 8px;">
      <div style="flex: 0 0 280px; border: 1px solid #d0d7de; border-radius: 6px; overflow: hidden;">
        <div style="background: #2d333b; color: #adbac7; padding: 8px 12px; font-weight: 600; font-size: 13px;">Repository Structure</div>
        <pre style="margin: 0; padding: 12px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.6; overflow-x: auto; white-space: pre;">{{repo_structure}}</pre>
      </div>
      <div style="flex: 1; border: 1px solid #d0d7de; border-radius: 6px; overflow: hidden;">
        <div style="background: #2d333b; color: #adbac7; padding: 8px 12px; font-weight: 600; font-size: 13px;">Key Source Files</div>
        <pre style="margin: 0; padding: 12px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{key_files}}</pre>
      </div>
    </div>
    <div style="margin-top: 8px; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
      <div style="background: #0d1117; color: #3fb950; padding: 8px 12px; font-weight: 600; font-size: 13px;">Test Output</div>
      <pre style="margin: 0; padding: 12px; background: #161b22; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{test_output}}</pre>
    </div>
  </div>

annotation_schemes:
  - name: "repo_criteria"
    description: "Rate the repository on each quality dimension."
    annotation_type: multirate
    labels:
      - "1 - Very Poor"
      - "2 - Poor"
      - "3 - Average"
      - "4 - Good"
      - "5 - Excellent"
    options:
      - "Architecture Design"
      - "Code Quality"
      - "Test Coverage"
      - "Documentation"
      - "Dependency Management"

  - name: "overall_grade"
    description: "Assign an overall letter grade to this repository"
    annotation_type: radio
    labels:
      - "A — Excellent"
      - "B — Good"
      - "C — Average"
      - "D — Below Average"
      - "F — Failing"
    keyboard_shortcuts:
      "A — Excellent": "1"
      "B — Good": "2"
      "C — Average": "3"
      "D — Below Average": "4"
      "F — Failing": "5"

  - name: "review_comments"
    description: "Detailed code review notes referencing specific files"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

[
  {
    "id": "devbench-001",
    "text": "Build a URL shortener service in Python. Requirements:\n- REST API with endpoints: POST /shorten (create short URL), GET /{code} (redirect), GET /stats/{code} (click analytics)\n- SQLite database for persistence\n- Base62 encoding for short codes\n- Rate limiting (100 requests/minute per IP)\n- Click tracking with timestamp, referrer, and user-agent\n- Input validation for URLs\n- Docker support",
    "repo_structure": "url-shortener/\n├── README.md\n├── requirements.txt\n├── Dockerfile\n├── docker-compose.yml\n├── setup.py\n├── src/\n│   ├── __init__.py\n│   ├── app.py\n│   ├── models.py\n│   ├── routes.py\n│   ├── encoder.py\n│   └── middleware.py\n├── tests/\n│   ├── __init__.py\n│   ├── test_routes.py\n│   ├── test_encoder.py\n│   └── test_models.py\n└── migrations/\n    └── 001_initial.sql",
    "key_files": "# src/app.py\nfrom flask import Flask\nfrom src.models import db\nfrom src.routes import api_bp\nfrom src.middleware import RateLimiter\n\ndef create_app(config=None):\n    app = Flask(__name__)\n    app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///urls.db'\n    if config:\n        app.config.update(config)\n    db.init_app(app)\n    app.register_blueprint(api_bp)\n    RateLimiter(app, limit=100, window=60)\n    return app\n\n# src/encoder.py\nimport string\n\nALPHABET = string.digits + string.ascii_letters  # 62 chars\n\ndef encode(num: int) -> str:\n    if num == 0:\n        return ALPHABET[0]\n    result = []\n    while num:\n        num, rem = divmod(num, 62)\n        result.append(ALPHABET[rem])\n    return ''.join(reversed(result))\n\ndef decode(code: str) -> int:\n    num = 0\n    for char in code:\n        num = num * 62 + ALPHABET.index(char)\n    return num",
    "test_output": "$ python -m pytest tests/ -v\ntests/test_encoder.py::test_encode_zero PASSED\ntests/test_encoder.py::test_encode_roundtrip PASSED\ntests/test_encoder.py::test_encode_large_number PASSED\ntests/test_routes.py::test_shorten_valid_url PASSED\ntests/test_routes.py::test_shorten_invalid_url PASSED\ntests/test_routes.py::test_redirect PASSED\ntests/test_routes.py::test_stats_endpoint PASSED\ntests/test_routes.py::test_rate_limiting PASSED\ntests/test_models.py::test_create_url PASSED\ntests/test_models.py::test_click_tracking PASSED\n\n10 passed in 0.84s",
    "language": "Python"
  },
  {
    "id": "devbench-002",
    "text": "Build a task queue system in Go. Requirements:\n- In-memory priority queue with persistent WAL (write-ahead log)\n- HTTP API: POST /enqueue, GET /dequeue, GET /status\n- Support task priorities (low, normal, high, critical)\n- Configurable retry with exponential backoff\n- Dead letter queue for failed tasks\n- Graceful shutdown with in-flight task completion\n- Prometheus metrics endpoint",
    "repo_structure": "taskqueue/\n├── README.md\n├── go.mod\n├── go.sum\n├── Makefile\n├── cmd/\n│   └── server/\n│       └── main.go\n├── internal/\n│   ├── queue/\n│   │   ├── priority_queue.go\n│   │   ├── wal.go\n│   │   └── dlq.go\n│   ├── handler/\n│   │   └── api.go\n│   └── metrics/\n│       └── prometheus.go\n├── pkg/\n│   └── models/\n│       └── task.go\n└── tests/\n    ├── queue_test.go\n    ├── wal_test.go\n    └── api_test.go",
    "key_files": "// internal/queue/priority_queue.go\npackage queue\n\nimport (\n    \"container/heap\"\n    \"sync\"\n    \"taskqueue/pkg/models\"\n)\n\ntype PriorityQueue struct {\n    mu    sync.RWMutex\n    items []*models.Task\n    wal   *WAL\n}\n\nfunc (pq *PriorityQueue) Len() int { return len(pq.items) }\nfunc (pq *PriorityQueue) Less(i, j int) bool {\n    return pq.items[i].Priority > pq.items[j].Priority\n}\nfunc (pq *PriorityQueue) Swap(i, j int) {\n    pq.items[i], pq.items[j] = pq.items[j], pq.items[i]\n}\n\nfunc (pq *PriorityQueue) Enqueue(task *models.Task) error {\n    pq.mu.Lock()\n    defer pq.mu.Unlock()\n    if err := pq.wal.Append(task); err != nil {\n        return err\n    }\n    heap.Push(pq, task)\n    return nil\n}\n\n// pkg/models/task.go\npackage models\n\nimport \"time\"\n\ntype Priority int\nconst (\n    Low      Priority = 0\n    Normal   Priority = 1\n    High     Priority = 2\n    Critical Priority = 3\n)\n\ntype Task struct {\n    ID        string    `json:\"id\"`\n    Payload   []byte    `json:\"payload\"`\n    Priority  Priority  `json:\"priority\"`\n    Retries   int       `json:\"retries\"`\n    MaxRetry  int       `json:\"max_retry\"`\n    CreatedAt time.Time `json:\"created_at\"`\n}",
    "test_output": "$ go test ./... -v\n=== RUN   TestEnqueueDequeue\n--- PASS: TestEnqueueDequeue (0.00s)\n=== RUN   TestPriorityOrdering\n--- PASS: TestPriorityOrdering (0.00s)\n=== RUN   TestWALPersistence\n--- PASS: TestWALPersistence (0.02s)\n=== RUN   TestWALRecovery\n--- PASS: TestWALRecovery (0.01s)\n=== RUN   TestDeadLetterQueue\n--- PASS: TestDeadLetterQueue (0.00s)\n=== RUN   TestAPIEnqueue\n--- PASS: TestAPIEnqueue (0.01s)\n=== RUN   TestAPIDequeue\n--- PASS: TestAPIDequeue (0.00s)\n=== RUN   TestRetryBackoff\n--- FAIL: TestRetryBackoff (0.05s)\n    queue_test.go:89: expected backoff 4s, got 2s\n\n7 passed, 1 failed",
    "language": "Go"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/devbench-repo-eval
potato start config.yaml

Details

Annotation Types

multirateradiotext

Domain

Software EngineeringCode Generation

Use Cases

Repository EvaluationCode Review

Tags

devbenchrepository-evaluationcode-qualitysoftware-lifecycleagentic-coding

Found an issue or want to improve this design?

Open an Issue