Code Review Annotation
Review AI coding agent output with GitHub PR-style inline diff comments, file-level correctness ratings, and approve or reject verdicts for code quality evaluation.
Code Review Annotation
New in v2.4.0
Evaluating code changes produced by AI coding agents requires more than a binary pass/fail judgment. Researchers and engineering teams need to assess code quality at multiple granularities: individual lines may contain bugs or style violations, entire files may be correctly modified or unnecessary, and the overall change set may solve the problem but introduce technical debt. This is the same workflow that human code reviewers follow when reviewing pull requests on GitHub.
Potato's code review annotation mode brings the GitHub PR review experience to agent evaluation. Annotators see unified diffs for every file the agent modified. They can click any diff line to leave an inline comment with a category tag. Each file gets a correctness and quality rating. The annotator gives a final verdict: approve, request changes, or comment only. All of this is captured in structured annotation data ready for training code quality models.
Inline Comments
Annotators click any line in a diff to open an inline comment form. Each comment has a category, a severity, and free-text content. The comment appears anchored to the specific line, just like GitHub PR review comments.
Comment Categories
The default comment categories cover the most common code review feedback types:
| Category | Description |
|---|---|
bug | Functional bug -- the code will not work correctly |
logic | Logic error -- the approach is flawed even if the syntax is valid |
security | Security vulnerability or unsafe practice |
performance | Performance issue -- unnecessary computation, memory leak, etc. |
style | Style violation -- naming, formatting, idiomatic usage |
suggestion | Alternative approach that would be better |
question | Clarification needed -- the reviewer is unsure about the intent |
praise | Positive feedback -- something the agent did well |
Configuration
annotation_schemes:
- name: inline_comments
annotation_type: code_review_comments
description: "Click any diff line to add an inline comment"
inline_comments:
# Comment categories
categories:
- value: bug
display: "Bug"
color: "#ef4444"
icon: "bug"
- value: logic
display: "Logic Error"
color: "#f97316"
icon: "alert-triangle"
- value: security
display: "Security"
color: "#dc2626"
icon: "shield-alert"
- value: performance
display: "Performance"
color: "#eab308"
icon: "zap"
- value: style
display: "Style"
color: "#6b7280"
icon: "palette"
- value: suggestion
display: "Suggestion"
color: "#3b82f6"
icon: "lightbulb"
- value: question
display: "Question"
color: "#8b5cf6"
icon: "help-circle"
- value: praise
display: "Praise"
color: "#22c55e"
icon: "thumbs-up"
# Severity levels (optional)
severity:
enabled: true
levels:
- value: critical
display: "Critical"
- value: major
display: "Major"
- value: minor
display: "Minor"
- value: nit
display: "Nit"
# Behavior
require_category: true
require_severity: false
allow_multi_line: true # comments can span a range of lines
allow_suggestions: true # annotator can write suggested replacement code
min_comments: 0 # minimum comments required before submissionSuggested Code Changes
When allow_suggestions is enabled, annotators can write a suggested replacement for the code block they are commenting on. This mirrors GitHub's "suggestion" feature. The suggestion appears in a code block below the comment and can be used to train code repair models.
# In inline comment output:
{
"file": "src/parser.py",
"line_start": 42,
"line_end": 44,
"category": "bug",
"severity": "critical",
"comment": "Off-by-one error: range should be inclusive of end",
"suggestion": "for i in range(start, end + 1):\n process(tokens[i])"
}File-Level Ratings
Each file modified by the agent receives two independent ratings: correctness and code quality.
Configuration
annotation_schemes:
- name: file_ratings
annotation_type: code_review_file_ratings
description: "Rate each modified file"
file_ratings:
dimensions:
- name: correctness
display: "Correctness"
description: "Are the changes to this file functionally correct?"
scale:
min: 1
max: 5
labels:
1: "Broken -- introduces bugs or breaks existing functionality"
2: "Mostly broken -- significant functional issues"
3: "Partially correct -- works but has edge cases or minor bugs"
4: "Mostly correct -- minor issues only"
5: "Fully correct -- changes work as intended"
- name: quality
display: "Code Quality"
description: "How well-written are the changes to this file?"
scale:
min: 1
max: 5
labels:
1: "Very poor -- unreadable, no structure"
2: "Poor -- hard to follow, inconsistent style"
3: "Acceptable -- works but could be cleaner"
4: "Good -- clean, idiomatic, well-structured"
5: "Excellent -- exemplary code, would merge as-is"
# Files to rate
include_unchanged: false # only rate files the agent modified
include_new_files: true # include files the agent created
include_deleted_files: true # include files the agent deleted
# Behavior
require_all_files: true # must rate every modified fileOutput Format
{
"file_ratings": {
"src/parser.py": {
"correctness": 4,
"quality": 3
},
"tests/test_parser.py": {
"correctness": 5,
"quality": 4
},
"src/utils.py": {
"correctness": 2,
"quality": 2
}
}
}Overall Verdict
After reviewing all files and leaving inline comments, the annotator gives an overall verdict for the entire change set.
Configuration
annotation_schemes:
- name: verdict
annotation_type: code_review_verdict
description: "Give an overall verdict on the code changes"
verdict:
options:
- value: approve
display: "Approve"
description: "Changes are correct and ready to merge"
color: "#22c55e"
icon: "check-circle"
- value: request_changes
display: "Request Changes"
description: "Changes need fixes before merging"
color: "#ef4444"
icon: "x-circle"
- value: comment_only
display: "Comment Only"
description: "Leaving feedback without a verdict"
color: "#6b7280"
icon: "message-circle"
# Optional summary text
require_summary: true
summary_placeholder: "Summarize your review..."
summary_min_length: 20Configuration Reference
Here is a complete configuration for a code review annotation task:
task_name: "Coding Agent Code Review"
task_dir: "."
data_files:
- "data/coding_traces.jsonl"
item_properties:
id_key: id
text_key: task_description
agentic:
enabled: true
trace_converter: claude_code
display_type: coding_trace
coding_trace_display:
diff_style: unified
diff_context_lines: 5
syntax_highlight: true
show_line_numbers: true
terminal_theme: dark
file_tree:
enabled: true
position: left
show_operation_icons: true
click_to_navigate: true
annotation_schemes:
# Inline comments on diff lines
- name: inline_comments
annotation_type: code_review_comments
inline_comments:
categories:
- { value: bug, display: "Bug", color: "#ef4444" }
- { value: logic, display: "Logic Error", color: "#f97316" }
- { value: security, display: "Security", color: "#dc2626" }
- { value: performance, display: "Performance", color: "#eab308" }
- { value: style, display: "Style", color: "#6b7280" }
- { value: suggestion, display: "Suggestion", color: "#3b82f6" }
- { value: question, display: "Question", color: "#8b5cf6" }
- { value: praise, display: "Praise", color: "#22c55e" }
severity:
enabled: true
levels:
- { value: critical, display: "Critical" }
- { value: major, display: "Major" }
- { value: minor, display: "Minor" }
- { value: nit, display: "Nit" }
require_category: true
allow_multi_line: true
allow_suggestions: true
# File-level correctness and quality
- name: file_ratings
annotation_type: code_review_file_ratings
file_ratings:
dimensions:
- name: correctness
display: "Correctness"
scale: { min: 1, max: 5 }
- name: quality
display: "Code Quality"
scale: { min: 1, max: 5 }
require_all_files: true
# Overall verdict
- name: verdict
annotation_type: code_review_verdict
verdict:
options:
- { value: approve, display: "Approve", color: "#22c55e" }
- { value: request_changes, display: "Request Changes", color: "#ef4444" }
- { value: comment_only, display: "Comment Only", color: "#6b7280" }
require_summary: true
summary_min_length: 20
output_annotation_dir: "output/"
output_annotation_format: "jsonl"The Annotation Workflow
Here is what annotators see and do when completing a code review annotation task:
-
Task overview: The task description appears at the top, showing what the agent was asked to do (e.g., "Fix the failing test in test_parser.py").
-
File tree navigation: The left sidebar shows all files the agent touched. Files are color-coded: green for new files, yellow for modified files, red for deleted files.
-
Diff review: The main panel shows unified diffs for each file. Annotators scroll through the diffs, reading each change.
-
Adding inline comments: Clicking a line number opens a comment form. The annotator selects a category (bug, suggestion, etc.), optionally selects a severity, writes their comment, and optionally adds a code suggestion.
-
File ratings: After reviewing each file's diff, the annotator rates it on correctness (1-5) and code quality (1-5) using the rating widgets below each file's diff.
-
Overall verdict: At the bottom, the annotator selects a verdict (approve, request changes, or comment only) and writes a summary of their review.
-
Submission: The annotator clicks "Submit" to save all inline comments, file ratings, and the verdict as a single annotation record.
Data Format
The complete output for a single code review annotation:
{
"id": "trace_042",
"annotator": "reviewer_01",
"timestamp": "2025-01-15T14:30:00Z",
"annotations": {
"inline_comments": [
{
"file": "src/parser.py",
"line_start": 42,
"line_end": 42,
"category": "bug",
"severity": "critical",
"comment": "This will throw IndexError when tokens list is empty",
"suggestion": "if tokens:\n return tokens[0]\nreturn None"
},
{
"file": "src/parser.py",
"line_start": 15,
"line_end": 15,
"category": "style",
"severity": "nit",
"comment": "Variable name 'x' is not descriptive"
},
{
"file": "tests/test_parser.py",
"line_start": 28,
"line_end": 30,
"category": "praise",
"comment": "Good edge case coverage for empty input"
}
],
"file_ratings": {
"src/parser.py": { "correctness": 3, "quality": 2 },
"tests/test_parser.py": { "correctness": 5, "quality": 4 }
},
"verdict": {
"decision": "request_changes",
"summary": "The core fix is on the right track but has an edge case bug with empty input. The test coverage is good. Fix the IndexError and clean up variable naming."
}
}
}Export
Code review annotations can be exported in several formats:
# Export as structured code review JSON
python -m potato.export \
-i output/ \
-f code_review \
-o results/reviews.jsonl
# Export inline comments only (for training code comment models)
python -m potato.export \
-i output/ \
-f code_review_comments \
-o results/comments.jsonl
# Export file ratings as a CSV (for analysis)
python -m potato.export \
-i output/ \
-f code_review_file_ratings \
-o results/file_ratings.csv
# Export verdict distribution summary
python -m potato.export \
-i output/ \
-f code_review_verdicts \
-o results/verdicts.jsonThe code_review_comments format is particularly useful for training models that generate code review comments or that predict the location and category of code issues.
See Also
- Coding Agent Annotation -- display coding agent traces with diff rendering and file trees
- Process Reward Annotation -- per-step reward signals for PRM training
- Live Coding Agent Observation -- observe and interact with coding agents in real time
- Agentic Annotation -- general-purpose agent trace annotation
- Export Formats -- all supported export formats
For implementation details, see the source documentation.