실시간 에이전트 평가

AI 에이전트가 작업하는 모습을 실시간으로 지켜보며 일시정지, 지시, 제어권 인수 컨트롤로 실행 중에 동작을 어노테이션합니다. Anthropic, Ollama, Claude SDK로 웹 및 코딩 에이전트를 지원합니다.

v2.4.0의 새 기능

실시간 에이전트 평가는 어노테이터가 AI 에이전트가 실시간으로 웹을 탐색하는 모습을 지켜보고 실행되는 동안 그 동작을 어노테이션할 수 있게 해줍니다 — 사후가 아닙니다. 에이전트는 스크린샷을 캡처하여 비전 LLM에 보내고, 액션을 받아 헤드리스 브라우저에서 실행합니다. 모든 단계가 어노테이터의 화면으로 실시간 스트리밍됩니다.

요구 사항

bash

pip install playwright anthropic
playwright install chromium
export ANTHROPIC_API_KEY=your_key_here

구성

yaml

live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  system_prompt: |
    You are a web browsing agent. Complete the given task efficiently.
    At each step, describe your thought, then output an action.
  max_steps: 30
  step_delay: 1.0
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true
 
instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Agent Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true

구성 참조

옵션	유형	기본값	설명
`endpoint_type`	string	`anthropic_vision`	에이전트의 LLM 제공자
`ai_config.model`	string	`claude-sonnet-4-20250514`	사용할 모델
`ai_config.api_key`	string	env var	API 키(`${VAR}` 구문 사용)
`ai_config.max_tokens`	int	`4096`	LLM 응답당 최대 토큰 수
`ai_config.temperature`	float	`0.3`	샘플링 온도
`system_prompt`	string	내장	에이전트용 시스템 프롬프트
`max_steps`	int	`30`	중지 전 최대 단계 수
`step_delay`	float	`1.0`	단계 간 초 단위 간격
`viewport.width`	int	`1280`	브라우저 뷰포트 너비
`viewport.height`	int	`720`	브라우저 뷰포트 높이
`allow_takeover`	bool	`true`	어노테이터가 수동 제어권을 가질 수 있도록 허용
`allow_instructions`	bool	`true`	어노테이터가 실행 중에 지시를 보낼 수 있도록 허용
`history_window`	int	`5`	LLM 컨텍스트에 포함되는 최근 단계 수

데이터 형식

각 인스턴스는 작업과 시작 URL을 제공합니다:

json

{
  "id": "task_001",
  "task_description": "Search for climate change on Wikipedia and find the year it was first described",
  "start_url": "https://en.wikipedia.org"
}

어노테이터 워크플로

어노테이터는 작업 설명을 읽고 Start Agent를 클릭합니다
헤드리스 Chromium 브라우저가 시작되어 LLM에 연결됩니다
에이전트가 탐색하는 동안 스크린샷이 뷰어로 실시간 스트리밍됩니다 — 각 단계는 스크린샷, 에이전트의 생각, 수행한 액션을 보여줍니다
어노테이터는 컨트롤 패널을 사용하여 상호작용할 수 있습니다:
- Pause / Resume — 단계 사이에서 에이전트를 멈춥니다
- Send Instructions — 실행 중에 에이전트의 컨텍스트에 메시지를 주입합니다
- Take Over — 수동 탐색 제어로 전환합니다
- Stop — 세션을 조기에 종료합니다
세션이 끝나면(성공, 실패 또는 max_steps 도달) 트레이스가 저장되고 화면이 검토 모드로 전환됩니다
어노테이터는 어노테이션 스키마를 채워 에이전트의 성능을 평가합니다

키보드 단축키

키	동작
`Space`	일시정지 / 재개
`Escape`	세션 종료

어노테이션 스키마 추가

실시간 에이전트 표시를 모든 Potato 어노테이션 스키마와 결합합니다:

yaml

annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes, fully"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "How efficiently did the agent work?"
    min_label: "Very inefficient"
    max_label: "Very efficient"
    scale: 5
  - annotation_type: text
    name: errors_observed
    question: "Describe any errors or unnecessary steps"
  - annotation_type: span
    name: error_steps
    question: "Mark any steps where the agent made an error"
    labels:
      - name: hallucination
      - name: wrong_target
      - name: unnecessary_action

전체 예시

yaml

task_name: "Live Agent Evaluation Study"
task_dir: "."
 
live_agent:
  endpoint_type: anthropic_vision
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 4096
    temperature: 0.3
  max_steps: 25
  step_delay: 1.5
  viewport:
    width: 1280
    height: 720
  allow_takeover: true
  allow_instructions: true
  history_window: 5
 
data_files:
  - "tasks.jsonl"
 
instance_display:
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: agent_trace
      type: live_agent
      label: "Live Session"
      display_options:
        show_overlays: true
        show_filmstrip: true
        show_thought: true
        show_controls: true
 
annotation_schemes:
  - annotation_type: radio
    name: task_success
    question: "Did the agent complete the task?"
    labels:
      - name: "Yes"
      - name: "Partially"
      - name: "No"
  - annotation_type: likert
    name: efficiency
    question: "Rate the agent's efficiency"
    scale: 5
    min_label: "Very inefficient"
    max_label: "Very efficient"
  - annotation_type: text
    name: notes
    question: "Notes on agent behavior"
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

아키텍처

실시간 에이전트는 Flask에서 백그라운드 스레드로 실행됩니다. 스크린샷과 상태 변경은 Server-Sent Events(SSE)를 통해 브라우저로 스트리밍됩니다. 어노테이터 컨트롤(일시정지, 지시, 제어권 인수, 중지)은 백그라운드 스레드와 동기화되는 REST 엔드포인트를 호출합니다.

text

Annotator (browser)  <── SSE stream ──  Flask Server  ── Playwright ──► Headless Browser
                     ──► REST control ─►              ◄── LLM API ────► Claude Vision

스크린샷은 {task_dir}/live_sessions/에 저장되며 필름스트립 보기를 위해 API를 통해 제공됩니다.

트레이스 내보내기

세션이 완료되면 Potato는 전체 트레이스를 web_agent_trace 호환 JSON으로 자동 내보냅니다. 여기에는 다음이 포함됩니다:

스크린샷, 액션, 생각, 관찰이 포함된 모든 단계
어노테이터가 실행 중에 보낸 모든 지시
타임스탬프 및 에이전트 구성 메타데이터
어노테이터 제어권 인수 이벤트

이는 완료된 실시간 세션을 나중에 표준 웹 에이전트 어노테이션 뷰어로 검토할 수 있다는 것을 의미합니다.

문제 해결

"Playwright is not installed" — pip install playwright && playwright install chromium를 실행하세요.

"Anthropic API key required" — ANTHROPIC_API_KEY 환경 변수를 설정하거나 구성에서 api_key: ${ANTHROPIC_API_KEY}를 사용하세요.

에이전트가 느려 보임 — 각 단계에는 LLM API 호출이 필요합니다(일반적으로 3–10초). LLM이 처리하는 동안 생각 표시기가 나타납니다. 긴 세션의 속도를 높이려면 history_window를 줄이세요.

스크린샷이 로드되지 않음 — task_dir에 쓰기 권한이 있고 서버에 사용 가능한 디스크 공간이 있는지 확인하세요.

코딩 에이전트 백엔드

웹 탐색 에이전트 외에도 Potato는 코딩 에이전트의 실시간 관찰을 지원합니다. 세 가지 백엔드를 사용할 수 있습니다:

Ollama (로컬, API 키 불필요)

완전히 로컬 모델로 코딩 에이전트 평가를 실행하세요 — API 키가 필요 없습니다.

yaml

live_agent:
  endpoint_type: coding_agent
  backend: ollama
  ai_config:
    model: qwen2.5-coder:7b
    host: "http://localhost:11434"
  max_steps: 50
  project_dir: "./workspace"

Anthropic API

코딩 에이전트 평가를 위해 도구 사용 기능을 갖춘 Claude를 사용하세요.

yaml

live_agent:
  endpoint_type: coding_agent
  backend: anthropic
  ai_config:
    model: claude-sonnet-4-20250514
    api_key: ${ANTHROPIC_API_KEY}
    max_tokens: 8192
  max_steps: 50
  project_dir: "./workspace"

Claude Agent SDK

고급 코딩 에이전트 세션을 위한 완전한 Claude Code 기능입니다.

yaml

live_agent:
  endpoint_type: coding_agent
  backend: claude_agent_sdk
  ai_config:
    max_turns: 50
  project_dir: "./workspace"

롤백, 분기, 트래젝토리 내보내기를 포함한 전체 참조는 실시간 코딩 에이전트를 참조하세요.

롤백 및 체크포인트

코딩 에이전트 세션의 경우, Potato는 파일 변경이 있을 때마다 git 커밋을 생성합니다. 이를 통해 다음이 가능합니다:

이전 체크포인트로의 원클릭 롤백
분기 및 재생 — 임의의 체크포인트에서 다른 접근 방식을 시도
검토를 위한 모든 파일 상태의 전체 기록

체크포인트는 세션별 전용 git 브랜치를 통해 자동으로 관리됩니다.

분기 트래젝토리

어노테이터가 롤백하여 다른 접근 방식을 시도하면, Potato는 분기 트래젝토리를 생성합니다. 두 분기 모두 출력에 보존되어 다음을 위한 풍부한 학습 데이터를 생성합니다:

프로세스 보상 모델 — 분기 전반의 단계별 정확성 레이블
선호 학습 — 어느 분기가 더 나은 결과를 냈는지
코드 리뷰 데이터셋 — 접근 방식 전반의 코드 품질 비교

추가 자료

실시간 코딩 에이전트 — Ollama, Anthropic, Claude SDK를 사용한 코딩 에이전트 관찰
웹 에이전트 어노테이션 — 사전 녹화된 에이전트 트레이스 검토
에이전트 어노테이션 — 에이전트 트레이스 형식 및 변환기 개요
프로세스 보상 어노테이션 — PRM 학습 데이터 수집
AI 지원 — 어노테이션 지원을 위한 LLM 통합

구현 세부 사항은 원본 문서를 참조하세요.