코딩 에이전트 어노테이션이 중요한 이유

Claude Code, Aider, SWE-Agent 같은 코딩 에이전트는 빠르게 좋아졌고, 이제 사람들은 실제로 그 작업물을 평가해야 합니다. 한 번의 실행은 지저분한 궤적입니다. 코드 편집, 터미널 명령, 파일 읽기, 추론 단계가 줄줄이 엮여 있습니다. 더 나은 에이전트를 훈련하려면 그러한 실행에 대한 사람의 피드백이 필요한데, 대부분의 팀이 가진 어노테이션 도구는 이런 종류의 데이터를 위해 만들어진 적이 없습니다.

일반 텍스트 어노테이션 인터페이스는 통합 diff를 렌더링하거나, 터미널 출력을 포맷하거나, 에이전트 트레이스의 중첩된 구조를 다룰 수 없습니다. 그래서 연구소들은 결국 각자의 평가 UI를 작성하게 되고, 같은 작업을 반복하며 서로 호환되지 않는 데이터셋을 만들게 됩니다.

이제 Potato는 코딩 에이전트 어노테이션을 직접 처리합니다. 트레이스를 위해 만들어진 렌더링 컴포넌트, 이런 종류의 평가를 위한 어노테이션 스키마, 그리고 곧장 훈련에 투입되는 내보내기를 제공합니다. 전체 기능 레퍼런스는 코딩 에이전트 어노테이션 문서와 더 폭넓은 에이전트 평가 가이드를 참고하세요.

CodingTraceDisplay: 트레이스 뷰어

대부분의 어노테이션 경험은 CodingTraceDisplay 컴포넌트를 통해 이루어집니다. 이 컴포넌트는 에이전트 궤적의 각 단계를 해당 단계 유형에 맞는 시각화로 렌더링합니다.

Potato에서 코딩 에이전트 어노테이션 인터페이스가 어떻게 보이는지 살펴보겠습니다:

diff 렌더링과 파일 트리를 보여주는 코딩 에이전트 트레이스 표시 CodingTraceDisplay는 코드 diff, 터미널 출력, 파일 읽기를 적절한 포맷으로 렌더링합니다

통합 Diff 보기

코드 편집은 제거된 줄과 추가된 줄에 대해 빨강/초록 강조가 들어간 통합 diff로 렌더링됩니다. diff 보기에는 줄 번호, 파일 경로 헤더, 변경 사항 주위의 컨텍스트 줄이 포함됩니다. 이는 대부분의 개발자가 이미 익숙한 GitHub 풀 리퀘스트 경험을 그대로 반영합니다.

yaml

# The diff rendering is automatic when your trace data includes tool_use
# steps with file edit operations. No special config is needed.
coding_agent:
  display:
    diff_style: "unified"         # "unified" or "split" side-by-side
    context_lines: 3              # Lines of context around changes
    syntax_highlighting: true     # Language-aware highlighting
    collapse_large_diffs: true    # Auto-collapse diffs > 100 lines
    large_diff_threshold: 100

어두운 터미널 블록

Bash 명령과 그 출력은 고정폭 글꼴, 적절한 ANSI 색상 지원, 긴 결과에 대한 스크롤 가능한 출력을 갖춘 어두운 터미널 블록으로 렌더링됩니다. 터미널 블록은 실행된 명령, 작업 디렉터리, 종료 코드를 보여줍니다.

yaml

coding_agent:
  display:
    terminal_theme: "dark"        # "dark" or "light"
    max_terminal_height: 400      # pixels, scrollable beyond this
    show_exit_codes: true
    show_working_directory: true
    ansi_colors: true             # Render ANSI escape sequences

줄 번호가 매겨진 코드 블록

파일 읽기 작업은 줄 번호가 매겨진 구문 강조 코드 블록으로 표시됩니다. 에이전트가 특정 줄 범위를 읽으면, 해당 줄만 원래 줄 번호를 그대로 유지한 채 표시되어 실제 파일과 상호 참조하기가 쉽습니다.

파일 트리 사이드바

접을 수 있는 사이드바는 궤적 동안 건드린 모든 파일을 트리 구조로 정리하여 보여줍니다. 각 파일에는 생성, 수정, 읽기, 삭제 중 어떤 작업이 이루어졌는지 나타내는 아이콘이 표시됩니다. 트리에서 파일을 클릭하면 트레이스에서 그 파일이 처음 등장하는 위치로 스크롤됩니다.

yaml

coding_agent:
  display:
    file_tree:
      enabled: true
      position: "left"            # "left" or "right"
      show_change_icons: true     # Icons for created/modified/deleted
      group_by: "directory"       # "directory" or "chronological"

접을 수 있는 출력

어떤 단계 유형이든 긴 출력은 트레이스를 읽기 쉽게 유지하기 위해 접을 수 있습니다. 어노테이터는 필요에 따라 개별 단계를 펼치거나, "Expand All" / "Collapse All" 컨트롤을 사용할 수 있습니다. 에이전트의 사고/추론 블록은 기본적으로 접혀 있지만 검토할 수 있도록 제공됩니다.

yaml

coding_agent:
  display:
    collapsible:
      auto_collapse_thinking: true
      auto_collapse_long_output: true
      long_output_threshold: 50   # lines
      default_expanded_types:     # These step types start expanded
        - "file_edit"
        - "bash_command"

프로세스 보상 모델(PRM) 스키마

프로세스 보상 모델은 최종 결과만 평가하는 대신 단계 수준에서 점수를 부여합니다. Potato는 속도와 정확도 사이의 서로 다른 절충을 위해 설계된 두 가지 PRM 어노테이션 모드를 지원합니다.

첫 오류 모드

첫 오류 모드에서는 어노테이터가 궤적을 스크롤하며 에이전트가 처음으로 잘못된 단계를 클릭합니다. 클릭한 단계 이전의 모든 단계는 자동으로 정답으로 표시되고, 그 이후의 모든 단계(클릭한 단계 포함)는 자동으로 오답으로 표시됩니다. 어노테이터가 단 하나의 지점만 식별하면 되기 때문에 어노테이션 속도가 크게 빨라집니다.

yaml

annotation_schemes:
  - annotation_type: process_reward
    name: prm_first_error
    mode: "first_error"
    labels:
      correct: "Correct"
      incorrect: "Incorrect"
    description: "Click the first step where the agent makes an error"
    allow_all_correct: true       # Button to mark entire trace as correct
    allow_all_incorrect: true     # Button to mark entire trace as wrong from step 1
    highlight_clicked_step: true
    auto_scroll_on_click: true

단계별 모드

단계별 모드에서는 모든 단계가 독립적인 평가를 받습니다. 이는 더 상세한 훈련 데이터를 만들어 내지만 트레이스당 시간이 더 걸립니다. 어노테이터는 각 단계를 정답, 오답, 또는 부분 정답으로 평가합니다.

yaml

annotation_schemes:
  - annotation_type: process_reward
    name: prm_per_step
    mode: "per_step"
    labels:
      correct:
        text: "Correct"
        description: "This step is logically sound and makes progress"
        keyboard_shortcut: "1"
      partially_correct:
        text: "Partially Correct"
        description: "Right direction but flawed execution"
        keyboard_shortcut: "2"
      incorrect:
        text: "Incorrect"
        description: "This step is wrong or counterproductive"
        keyboard_shortcut: "3"
    require_all_steps: true       # Cannot submit until all steps rated
    show_progress_bar: true

코드 리뷰 스키마

코드 리뷰 인터페이스는 GitHub PR 스타일의 어노테이션 컨트롤을 제공합니다:

diff 줄에 인라인 코멘트가 달린 코드 리뷰 어노테이션 어노테이터는 diff 줄을 클릭해 인라인 코멘트를 추가하고, 파일을 평가하고, 승인/거부 판정을 내릴 수 있습니다

코드 리뷰 스키마는 GitHub PR 스타일의 어노테이션을 에이전트 트레이스로 가져옵니다. 어노테이터는 diff 안의 특정 줄에 인라인 코멘트를 남기고, 개별 파일을 평가하고, 전체 판정을 제공할 수 있습니다.

yaml

annotation_schemes:
  - annotation_type: code_review
    name: agent_review
    inline_comments:
      enabled: true
      categories:                 # Optional categorization for comments
        - "Bug"
        - "Style"
        - "Logic Error"
        - "Unnecessary Change"
        - "Missing Error Handling"
    file_ratings:
      enabled: true
      scale: [1, 2, 3, 4, 5]
      labels: ["Poor", "Below Average", "Acceptable", "Good", "Excellent"]
    verdict:
      enabled: true
      options:
        - value: "approve"
          text: "Approve"
          description: "Changes are correct and complete"
        - value: "request_changes"
          text: "Request Changes"
          description: "Changes need fixes before merging"
        - value: "comment"
          text: "Comment"
          description: "General feedback, no strong opinion"
    require_comment_on_reject: true

트레이스 변환기: 어떤 에이전트에서든 가져오기

Potato에는 가장 인기 있는 세 가지 코딩 에이전트 형식을 위한 내장 변환기가 포함되어 있습니다. 변환기는 각 형식을 Potato의 내부 구조화된 트레이스 표현으로 정규화합니다.

Claude Code (Anthropic 메시지 API)

Claude Code 트레이스는 tool_use와 tool_result 콘텐츠 블록이 있는 Anthropic 메시지 API 형식을 사용합니다. 변환기는 도구 호출에서 파일 편집, bash 명령, 파일 읽기를 추출하고 어시스턴트의 추론 텍스트를 보존합니다.

bash

# Convert Claude Code traces to Potato format
potato convert-traces \
  --format claude_code \
  --input ./claude_traces/ \
  --output ./potato_data/traces.jsonl

Aider (편집 블록이 있는 Markdown 채팅)

Aider는 SEARCH/REPLACE 편집 블록이 있는 markdown 형식의 채팅 로그를 생성합니다. 변환기는 이 블록을 파싱하여 파일 편집을 재구성하고 펜스로 둘러싼 코드 블록에서 셸 명령을 추출합니다.

bash

# Convert Aider chat logs
potato convert-traces \
  --format aider \
  --input ./aider_logs/ \
  --output ./potato_data/traces.jsonl

SWE-Agent (사고/행동/관찰)

SWE-Agent는 사고/행동/관찰 루프 형식을 사용합니다. 변환기는 행동을 적절한 단계 유형(편집, bash, 읽기)에 매핑하고 에이전트의 사고 연쇄 추론을 접을 수 있는 사고 블록으로 보존합니다.

bash

# Convert SWE-Agent trajectories
potato convert-traces \
  --format swe_agent \
  --input ./swe_agent_trajectories/ \
  --output ./potato_data/traces.jsonl

자동 감지

여러 에이전트의 트레이스가 있다면, Potato는 각 파일의 구조를 기반으로 형식을 자동 감지할 수 있습니다:

bash

# Auto-detect format for mixed trace directories
potato convert-traces \
  --format auto \
  --input ./mixed_traces/ \
  --output ./potato_data/traces.jsonl

훈련 파이프라인 내보내기

어노테이션된 트레이스는 모델 훈련에 바로 쓸 수 있는 형식으로 내보낼 수 있습니다.

PRM 형식

프로세스 보상 모델 훈련을 위한 단계 수준 보상 레이블:

python

# Exported PRM format (one line per trace)
{
  "trace_id": "trace_001",
  "steps": [
    {"step_idx": 0, "content": "Read file src/main.py", "label": "correct"},
    {"step_idx": 1, "content": "Edit src/main.py: fix import", "label": "correct"},
    {"step_idx": 2, "content": "Run tests", "label": "correct"},
    {"step_idx": 3, "content": "Edit src/utils.py: wrong fix", "label": "incorrect"},
    {"step_idx": 4, "content": "Run tests again", "label": "incorrect"}
  ],
  "first_error_step": 3
}

DPO/RLHF 선호 쌍

쌍별 비교 어노테이션과 결합하면, Potato는 직접 선호 최적화(DPO)나 RLHF 훈련에 적합한 선호 쌍을 생성합니다:

python

# Exported preference pair format
{
  "prompt": "Fix the failing test in src/test_utils.py",
  "chosen": {"trace_id": "trace_001", "steps": [...]},
  "rejected": {"trace_id": "trace_002", "steps": [...]},
  "preference_strength": 0.85
}

SWE-bench 호환 결과

게시된 벤치마크와 직접 비교할 수 있도록 SWE-bench 평가 harness와 호환되는 형식으로 어노테이션을 내보내세요:

bash

# Export to SWE-bench format
potato export \
  --format swe_bench \
  --project ./my_project/ \
  --output ./swe_bench_results.json

빠른 시작

처음부터 실행 중인 어노테이션 서버까지 가는 데 약 5분이 걸립니다.

설치

bash

pip install potato-annotation[coding-agents]

트레이스 변환

bash

# Convert traces from your coding agent
potato convert-traces \
  --format auto \
  --input ./my_agent_traces/ \
  --output ./data/traces.jsonl

구성 만들기

다음은 PRM과 코드 리뷰 스키마를 모두 사용하는 코딩 에이전트 평가 프로젝트를 위한 완전한 구성입니다:

yaml

# config.yaml
project_name: "Coding Agent Evaluation"
port: 8000
 
data:
  source: "local"
  input_path: "./data/traces.jsonl"
  data_format: "coding_trace"
 
coding_agent:
  display:
    diff_style: "unified"
    context_lines: 3
    syntax_highlighting: true
    collapse_large_diffs: true
    terminal_theme: "dark"
    max_terminal_height: 400
    show_exit_codes: true
    file_tree:
      enabled: true
      position: "left"
      show_change_icons: true
    collapsible:
      auto_collapse_thinking: true
      auto_collapse_long_output: true
 
annotation_schemes:
  - annotation_type: process_reward
    name: prm_evaluation
    mode: "first_error"
    labels:
      correct: "Correct"
      incorrect: "Incorrect"
    allow_all_correct: true
    description: "Click the first step where the agent makes a mistake"
 
  - annotation_type: code_review
    name: code_quality
    inline_comments:
      enabled: true
      categories: ["Bug", "Logic Error", "Style", "Missing Error Handling"]
    file_ratings:
      enabled: true
      scale: [1, 2, 3, 4, 5]
    verdict:
      enabled: true
      options:
        - value: "approve"
          text: "Approve"
        - value: "request_changes"
          text: "Request Changes"
        - value: "comment"
          text: "Comment"
 
  - annotation_type: text_input
    name: overall_notes
    label: "Additional Notes"
    placeholder: "Any other observations about this trace..."
    required: false
 
output:
  path: "./output/"
  format: "jsonl"
  export_formats:
    - "prm"
    - "swe_bench"
 
quality_control:
  inter_annotator_agreement: true
  overlap_percentage: 20
  minimum_time_per_instance: 30  # seconds
 
annotators:
  - username: "annotator1"
    password: "secure_password_1"
  - username: "annotator2"
    password: "secure_password_2"

서버 실행

bash

potato start config.yaml -p 8000

브라우저에서 http://localhost:8000을 열고, 로그인한 뒤 어노테이션을 시작하세요. 위에서 설명한 전체 diff 렌더링, 터미널 출력, 프로세스 보상 어노테이션을 모두 사용할 수 있습니다.

다음 단계

이것은 첫 번째 릴리스이며, 하고 싶은 일이 더 많습니다. 목록에는 더 많은 에이전트 형식 지원, 다중 파일 리팩터링을 위한 더 나은 시각화, OpenRLHF와 TRL 같은 훈련 프레임워크와의 더 긴밀한 통합이 있습니다.

새로운 트레이스 변환기, 스키마, 또는 내보내기 형식을 작성하신다면 기여를 환영합니다. 그리고 팀에서 코딩 에이전트를 평가하다가 이 구성으로 다루지 못하는 부분을 만나면, 저희 GitHub 저장소에 이슈를 열어 주세요.