DevBench Repository Evaluation

Evaluate AI-generated repositories across the full software development lifecycle. Annotators rate architecture design, code quality, test coverage, documentation, and dependency management for generated projects.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# DevBench Repository Evaluation
# Based on "Prompting Large Language Models to Tackle the Full Software Development Lifecycle" (Li et al., arXiv 2024)
# Task: Evaluate AI-generated repositories for architecture, code quality, testing, documentation, and dependencies

annotation_task_name: "DevBench Repository Evaluation"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; max-width: 1500px; margin: 0 auto;">
    <div style="background: #24292f; color: #ffffff; padding: 10px 16px; border-radius: 6px 6px 0 0; font-size: 14px; font-weight: 600;">
      Project: {{language}} Repository
    </div>
    <div style="border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; margin: 8px 0; background: #f6f8fa;">
      <h3 style="margin-top: 0; color: #24292f;">Project Specification</h3>
      <div style="white-space: pre-wrap; font-size: 14px; line-height: 1.6; color: #1f2328;">{{text}}</div>
    </div>
    <div style="display: flex; gap: 12px; margin-top: 8px;">
      <div style="flex: 0 0 280px; border: 1px solid #d0d7de; border-radius: 6px; overflow: hidden;">
        <div style="background: #2d333b; color: #adbac7; padding: 8px 12px; font-weight: 600; font-size: 13px;">Repository Structure</div>
        <pre style="margin: 0; padding: 12px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.6; overflow-x: auto; white-space: pre;">{{repo_structure}}</pre>
      </div>
      <div style="flex: 1; border: 1px solid #d0d7de; border-radius: 6px; overflow: hidden;">
        <div style="background: #2d333b; color: #adbac7; padding: 8px 12px; font-weight: 600; font-size: 13px;">Key Source Files</div>
        <pre style="margin: 0; padding: 12px; background: #22272e; color: #adbac7; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{key_files}}</pre>
      </div>
    </div>
    <div style="margin-top: 8px; border: 1px solid #30363d; border-radius: 6px; overflow: hidden;">
      <div style="background: #0d1117; color: #3fb950; padding: 8px 12px; font-weight: 600; font-size: 13px;">Test Output</div>
      <pre style="margin: 0; padding: 12px; background: #161b22; color: #c9d1d9; font-family: 'SFMono-Regular', Consolas, monospace; font-size: 12px; line-height: 1.5; overflow-x: auto; white-space: pre;">{{test_output}}</pre>
    </div>
  </div>

annotation_schemes:
  - name: "repo_criteria"
    description: "Rate the repository on each quality dimension."
    annotation_type: multirate
    labels:
      - "1 - Very Poor"
      - "2 - Poor"
      - "3 - Average"
      - "4 - Good"
      - "5 - Excellent"
    options:
      - "Architecture Design"
      - "Code Quality"
      - "Test Coverage"
      - "Documentation"
      - "Dependency Management"

  - name: "overall_grade"
    description: "Assign an overall letter grade to this repository"
    annotation_type: radio
    labels:
      - "A — Excellent"
      - "B — Good"
      - "C — Average"
      - "D — Below Average"
      - "F — Failing"
    keyboard_shortcuts:
      "A — Excellent": "1"
      "B — Good": "2"
      "C — Average": "3"
      "D — Below Average": "4"
      "F — Failing": "5"

  - name: "review_comments"
    description: "Detailed code review notes referencing specific files"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "devbench-001",
    "text": "Build a URL shortener service in Python. Requirements:\n- REST API with endpoints: POST /shorten (create short URL), GET /{code} (redirect), GET /stats/{code} (click analytics)\n- SQLite database for persistence\n- Base62 encoding for short codes\n- Rate limiting (100 requests/minute per IP)\n- Click tracking with timestamp, referrer, and user-agent\n- Input validation for URLs\n- Docker support",
    "repo_structure": "url-shortener/\n├── README.md\n├── requirements.txt\n├── Dockerfile\n├── docker-compose.yml\n├── setup.py\n├── src/\n│   ├── __init__.py\n│   ├── app.py\n│   ├── models.py\n│   ├── routes.py\n│   ├── encoder.py\n│   └── middleware.py\n├── tests/\n│   ├── __init__.py\n│   ├── test_routes.py\n│   ├── test_encoder.py\n│   └── test_models.py\n└── migrations/\n    └── 001_initial.sql",
    "key_files": "# src/app.py\nfrom flask import Flask\nfrom src.models import db\nfrom src.routes import api_bp\nfrom src.middleware import RateLimiter\n\ndef create_app(config=None):\n    app = Flask(__name__)\n    app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///urls.db'\n    if config:\n        app.config.update(config)\n    db.init_app(app)\n    app.register_blueprint(api_bp)\n    RateLimiter(app, limit=100, window=60)\n    return app\n\n# src/encoder.py\nimport string\n\nALPHABET = string.digits + string.ascii_letters  # 62 chars\n\ndef encode(num: int) -> str:\n    if num == 0:\n        return ALPHABET[0]\n    result = []\n    while num:\n        num, rem = divmod(num, 62)\n        result.append(ALPHABET[rem])\n    return ''.join(reversed(result))\n\ndef decode(code: str) -> int:\n    num = 0\n    for char in code:\n        num = num * 62 + ALPHABET.index(char)\n    return num",
    "test_output": "$ python -m pytest tests/ -v\ntests/test_encoder.py::test_encode_zero PASSED\ntests/test_encoder.py::test_encode_roundtrip PASSED\ntests/test_encoder.py::test_encode_large_number PASSED\ntests/test_routes.py::test_shorten_valid_url PASSED\ntests/test_routes.py::test_shorten_invalid_url PASSED\ntests/test_routes.py::test_redirect PASSED\ntests/test_routes.py::test_stats_endpoint PASSED\ntests/test_routes.py::test_rate_limiting PASSED\ntests/test_models.py::test_create_url PASSED\ntests/test_models.py::test_click_tracking PASSED\n\n10 passed in 0.84s",
    "language": "Python"
  },
  {
    "id": "devbench-002",
    "text": "Build a task queue system in Go. Requirements:\n- In-memory priority queue with persistent WAL (write-ahead log)\n- HTTP API: POST /enqueue, GET /dequeue, GET /status\n- Support task priorities (low, normal, high, critical)\n- Configurable retry with exponential backoff\n- Dead letter queue for failed tasks\n- Graceful shutdown with in-flight task completion\n- Prometheus metrics endpoint",
    "repo_structure": "taskqueue/\n├── README.md\n├── go.mod\n├── go.sum\n├── Makefile\n├── cmd/\n│   └── server/\n│       └── main.go\n├── internal/\n│   ├── queue/\n│   │   ├── priority_queue.go\n│   │   ├── wal.go\n│   │   └── dlq.go\n│   ├── handler/\n│   │   └── api.go\n│   └── metrics/\n│       └── prometheus.go\n├── pkg/\n│   └── models/\n│       └── task.go\n└── tests/\n    ├── queue_test.go\n    ├── wal_test.go\n    └── api_test.go",
    "key_files": "// internal/queue/priority_queue.go\npackage queue\n\nimport (\n    \"container/heap\"\n    \"sync\"\n    \"taskqueue/pkg/models\"\n)\n\ntype PriorityQueue struct {\n    mu    sync.RWMutex\n    items []*models.Task\n    wal   *WAL\n}\n\nfunc (pq *PriorityQueue) Len() int { return len(pq.items) }\nfunc (pq *PriorityQueue) Less(i, j int) bool {\n    return pq.items[i].Priority > pq.items[j].Priority\n}\nfunc (pq *PriorityQueue) Swap(i, j int) {\n    pq.items[i], pq.items[j] = pq.items[j], pq.items[i]\n}\n\nfunc (pq *PriorityQueue) Enqueue(task *models.Task) error {\n    pq.mu.Lock()\n    defer pq.mu.Unlock()\n    if err := pq.wal.Append(task); err != nil {\n        return err\n    }\n    heap.Push(pq, task)\n    return nil\n}\n\n// pkg/models/task.go\npackage models\n\nimport \"time\"\n\ntype Priority int\nconst (\n    Low      Priority = 0\n    Normal   Priority = 1\n    High     Priority = 2\n    Critical Priority = 3\n)\n\ntype Task struct {\n    ID        string    `json:\"id\"`\n    Payload   []byte    `json:\"payload\"`\n    Priority  Priority  `json:\"priority\"`\n    Retries   int       `json:\"retries\"`\n    MaxRetry  int       `json:\"max_retry\"`\n    CreatedAt time.Time `json:\"created_at\"`\n}",
    "test_output": "$ go test ./... -v\n=== RUN   TestEnqueueDequeue\n--- PASS: TestEnqueueDequeue (0.00s)\n=== RUN   TestPriorityOrdering\n--- PASS: TestPriorityOrdering (0.00s)\n=== RUN   TestWALPersistence\n--- PASS: TestWALPersistence (0.02s)\n=== RUN   TestWALRecovery\n--- PASS: TestWALRecovery (0.01s)\n=== RUN   TestDeadLetterQueue\n--- PASS: TestDeadLetterQueue (0.00s)\n=== RUN   TestAPIEnqueue\n--- PASS: TestAPIEnqueue (0.01s)\n=== RUN   TestAPIDequeue\n--- PASS: TestAPIDequeue (0.00s)\n=== RUN   TestRetryBackoff\n--- FAIL: TestRetryBackoff (0.05s)\n    queue_test.go:89: expected backoff 4s, got 2s\n\n7 passed, 1 failed",
    "language": "Go"
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/devbench-repo-eval
potato start config.yaml

Dataset & paper

Li et al., arXiv 2024

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{li2024devbench, title={Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study}, author={Bowen Li and Wenhan Wu and Ziwei Tang and Lin Shi and John Yang and others}, journal={arXiv preprint arXiv:2403.08604}, year={2024}}

Details

Annotation Types

multirateradiotext

Domain

Software EngineeringCode Generation

Use Cases

Repository EvaluationCode Review

Related Designs

TrajEval Staged Evaluation

Evaluate code agent trajectories decomposed into search, edit, and verification stages, rating quality of each stage and determining overall pass/fail verdict.

multirateradio

AgentRewardBench Trajectory Scoring

Evaluate web agent trajectories by rating step-level quality across multiple dimensions, judging overall success, and identifying where automatic evaluators disagree with human judgment.