Skip to content
यह पृष्ठ अभी आपकी भाषा में उपलब्ध नहीं है। अंग्रेज़ी संस्करण दिखाया जा रहा है।

Code Review Annotation

Review AI coding agent output with GitHub PR-style inline diff comments, file-level correctness ratings, and approve or reject verdicts for code quality evaluation.

Code Review Annotation

New in v2.4.0

Evaluating code changes produced by AI coding agents requires more than a binary pass/fail judgment. Researchers and engineering teams need to assess code quality at multiple granularities: individual lines may contain bugs or style violations, entire files may be correctly modified or unnecessary, and the overall change set may solve the problem but introduce technical debt. This is the same workflow that human code reviewers follow when reviewing pull requests on GitHub.

Potato's code review annotation mode brings the GitHub PR review experience to agent evaluation. Annotators see unified diffs for every file the agent modified. They can click any diff line to leave an inline comment with a category tag. Each file gets a correctness and quality rating. The annotator gives a final verdict: approve, request changes, or comment only. All of this is captured in structured annotation data ready for training code quality models.

Inline Comments

Annotators click any line in a diff to open an inline comment form. Each comment has a category, a severity, and free-text content. The comment appears anchored to the specific line, just like GitHub PR review comments.

Comment Categories

The default comment categories cover the most common code review feedback types:

CategoryDescription
bugFunctional bug -- the code will not work correctly
logicLogic error -- the approach is flawed even if the syntax is valid
securitySecurity vulnerability or unsafe practice
performancePerformance issue -- unnecessary computation, memory leak, etc.
styleStyle violation -- naming, formatting, idiomatic usage
suggestionAlternative approach that would be better
questionClarification needed -- the reviewer is unsure about the intent
praisePositive feedback -- something the agent did well

Configuration

yaml
annotation_schemes:
  - name: inline_comments
    annotation_type: code_review_comments
    description: "Click any diff line to add an inline comment"
 
    inline_comments:
      # Comment categories
      categories:
        - value: bug
          display: "Bug"
          color: "#ef4444"
          icon: "bug"
        - value: logic
          display: "Logic Error"
          color: "#f97316"
          icon: "alert-triangle"
        - value: security
          display: "Security"
          color: "#dc2626"
          icon: "shield-alert"
        - value: performance
          display: "Performance"
          color: "#eab308"
          icon: "zap"
        - value: style
          display: "Style"
          color: "#6b7280"
          icon: "palette"
        - value: suggestion
          display: "Suggestion"
          color: "#3b82f6"
          icon: "lightbulb"
        - value: question
          display: "Question"
          color: "#8b5cf6"
          icon: "help-circle"
        - value: praise
          display: "Praise"
          color: "#22c55e"
          icon: "thumbs-up"
 
      # Severity levels (optional)
      severity:
        enabled: true
        levels:
          - value: critical
            display: "Critical"
          - value: major
            display: "Major"
          - value: minor
            display: "Minor"
          - value: nit
            display: "Nit"
 
      # Behavior
      require_category: true
      require_severity: false
      allow_multi_line: true       # comments can span a range of lines
      allow_suggestions: true      # annotator can write suggested replacement code
      min_comments: 0              # minimum comments required before submission

Suggested Code Changes

When allow_suggestions is enabled, annotators can write a suggested replacement for the code block they are commenting on. This mirrors GitHub's "suggestion" feature. The suggestion appears in a code block below the comment and can be used to train code repair models.

yaml
# In inline comment output:
{
  "file": "src/parser.py",
  "line_start": 42,
  "line_end": 44,
  "category": "bug",
  "severity": "critical",
  "comment": "Off-by-one error: range should be inclusive of end",
  "suggestion": "for i in range(start, end + 1):\n    process(tokens[i])"
}

File-Level Ratings

Each file modified by the agent receives two independent ratings: correctness and code quality.

Configuration

yaml
annotation_schemes:
  - name: file_ratings
    annotation_type: code_review_file_ratings
    description: "Rate each modified file"
 
    file_ratings:
      dimensions:
        - name: correctness
          display: "Correctness"
          description: "Are the changes to this file functionally correct?"
          scale:
            min: 1
            max: 5
            labels:
              1: "Broken -- introduces bugs or breaks existing functionality"
              2: "Mostly broken -- significant functional issues"
              3: "Partially correct -- works but has edge cases or minor bugs"
              4: "Mostly correct -- minor issues only"
              5: "Fully correct -- changes work as intended"
 
        - name: quality
          display: "Code Quality"
          description: "How well-written are the changes to this file?"
          scale:
            min: 1
            max: 5
            labels:
              1: "Very poor -- unreadable, no structure"
              2: "Poor -- hard to follow, inconsistent style"
              3: "Acceptable -- works but could be cleaner"
              4: "Good -- clean, idiomatic, well-structured"
              5: "Excellent -- exemplary code, would merge as-is"
 
      # Files to rate
      include_unchanged: false     # only rate files the agent modified
      include_new_files: true      # include files the agent created
      include_deleted_files: true  # include files the agent deleted
 
      # Behavior
      require_all_files: true      # must rate every modified file

Output Format

json
{
  "file_ratings": {
    "src/parser.py": {
      "correctness": 4,
      "quality": 3
    },
    "tests/test_parser.py": {
      "correctness": 5,
      "quality": 4
    },
    "src/utils.py": {
      "correctness": 2,
      "quality": 2
    }
  }
}

Overall Verdict

After reviewing all files and leaving inline comments, the annotator gives an overall verdict for the entire change set.

Configuration

yaml
annotation_schemes:
  - name: verdict
    annotation_type: code_review_verdict
    description: "Give an overall verdict on the code changes"
 
    verdict:
      options:
        - value: approve
          display: "Approve"
          description: "Changes are correct and ready to merge"
          color: "#22c55e"
          icon: "check-circle"
        - value: request_changes
          display: "Request Changes"
          description: "Changes need fixes before merging"
          color: "#ef4444"
          icon: "x-circle"
        - value: comment_only
          display: "Comment Only"
          description: "Leaving feedback without a verdict"
          color: "#6b7280"
          icon: "message-circle"
 
      # Optional summary text
      require_summary: true
      summary_placeholder: "Summarize your review..."
      summary_min_length: 20

Configuration Reference

Here is a complete configuration for a code review annotation task:

yaml
task_name: "Coding Agent Code Review"
task_dir: "."
 
data_files:
  - "data/coding_traces.jsonl"
 
item_properties:
  id_key: id
  text_key: task_description
 
agentic:
  enabled: true
  trace_converter: claude_code
  display_type: coding_trace
 
  coding_trace_display:
    diff_style: unified
    diff_context_lines: 5
    syntax_highlight: true
    show_line_numbers: true
    terminal_theme: dark
    file_tree:
      enabled: true
      position: left
      show_operation_icons: true
      click_to_navigate: true
 
annotation_schemes:
  # Inline comments on diff lines
  - name: inline_comments
    annotation_type: code_review_comments
    inline_comments:
      categories:
        - { value: bug, display: "Bug", color: "#ef4444" }
        - { value: logic, display: "Logic Error", color: "#f97316" }
        - { value: security, display: "Security", color: "#dc2626" }
        - { value: performance, display: "Performance", color: "#eab308" }
        - { value: style, display: "Style", color: "#6b7280" }
        - { value: suggestion, display: "Suggestion", color: "#3b82f6" }
        - { value: question, display: "Question", color: "#8b5cf6" }
        - { value: praise, display: "Praise", color: "#22c55e" }
      severity:
        enabled: true
        levels:
          - { value: critical, display: "Critical" }
          - { value: major, display: "Major" }
          - { value: minor, display: "Minor" }
          - { value: nit, display: "Nit" }
      require_category: true
      allow_multi_line: true
      allow_suggestions: true
 
  # File-level correctness and quality
  - name: file_ratings
    annotation_type: code_review_file_ratings
    file_ratings:
      dimensions:
        - name: correctness
          display: "Correctness"
          scale: { min: 1, max: 5 }
        - name: quality
          display: "Code Quality"
          scale: { min: 1, max: 5 }
      require_all_files: true
 
  # Overall verdict
  - name: verdict
    annotation_type: code_review_verdict
    verdict:
      options:
        - { value: approve, display: "Approve", color: "#22c55e" }
        - { value: request_changes, display: "Request Changes", color: "#ef4444" }
        - { value: comment_only, display: "Comment Only", color: "#6b7280" }
      require_summary: true
      summary_min_length: 20
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

The Annotation Workflow

Here is what annotators see and do when completing a code review annotation task:

  1. Task overview: The task description appears at the top, showing what the agent was asked to do (e.g., "Fix the failing test in test_parser.py").

  2. File tree navigation: The left sidebar shows all files the agent touched. Files are color-coded: green for new files, yellow for modified files, red for deleted files.

  3. Diff review: The main panel shows unified diffs for each file. Annotators scroll through the diffs, reading each change.

  4. Adding inline comments: Clicking a line number opens a comment form. The annotator selects a category (bug, suggestion, etc.), optionally selects a severity, writes their comment, and optionally adds a code suggestion.

  5. File ratings: After reviewing each file's diff, the annotator rates it on correctness (1-5) and code quality (1-5) using the rating widgets below each file's diff.

  6. Overall verdict: At the bottom, the annotator selects a verdict (approve, request changes, or comment only) and writes a summary of their review.

  7. Submission: The annotator clicks "Submit" to save all inline comments, file ratings, and the verdict as a single annotation record.

Data Format

The complete output for a single code review annotation:

json
{
  "id": "trace_042",
  "annotator": "reviewer_01",
  "timestamp": "2025-01-15T14:30:00Z",
  "annotations": {
    "inline_comments": [
      {
        "file": "src/parser.py",
        "line_start": 42,
        "line_end": 42,
        "category": "bug",
        "severity": "critical",
        "comment": "This will throw IndexError when tokens list is empty",
        "suggestion": "if tokens:\n    return tokens[0]\nreturn None"
      },
      {
        "file": "src/parser.py",
        "line_start": 15,
        "line_end": 15,
        "category": "style",
        "severity": "nit",
        "comment": "Variable name 'x' is not descriptive"
      },
      {
        "file": "tests/test_parser.py",
        "line_start": 28,
        "line_end": 30,
        "category": "praise",
        "comment": "Good edge case coverage for empty input"
      }
    ],
    "file_ratings": {
      "src/parser.py": { "correctness": 3, "quality": 2 },
      "tests/test_parser.py": { "correctness": 5, "quality": 4 }
    },
    "verdict": {
      "decision": "request_changes",
      "summary": "The core fix is on the right track but has an edge case bug with empty input. The test coverage is good. Fix the IndexError and clean up variable naming."
    }
  }
}

Export

Code review annotations can be exported in several formats:

bash
# Export as structured code review JSON
python -m potato.export \
  -i output/ \
  -f code_review \
  -o results/reviews.jsonl
 
# Export inline comments only (for training code comment models)
python -m potato.export \
  -i output/ \
  -f code_review_comments \
  -o results/comments.jsonl
 
# Export file ratings as a CSV (for analysis)
python -m potato.export \
  -i output/ \
  -f code_review_file_ratings \
  -o results/file_ratings.csv
 
# Export verdict distribution summary
python -m potato.export \
  -i output/ \
  -f code_review_verdicts \
  -o results/verdicts.json

The code_review_comments format is particularly useful for training models that generate code review comments or that predict the location and category of code issues.

See Also

For implementation details, see the source documentation.