비주얼 AI 지원

비전 LLM — GPT-4 Vision, Claude Vision, Gemini, YOLO — 을 사용하여 이미지를 사전 주석하고, 바운딩 박스 제안을 생성하며, Potato의 시각 작업을 지원합니다.

v2.1.0 신규

Potato는 객체 탐지를 위한 YOLO와 GPT-4o, Claude, Ollama 비전 모델과 같은 비전-언어 모델(VLLM)을 포함한 다양한 비전 모델을 사용하여 이미지 및 비디오 주석 작업을 위한 AI 기반 지원을 제공합니다.

개요

비주얼 AI 지원으로 다음이 가능합니다:

객체 탐지: YOLO 또는 VLLM을 사용하여 이미지에서 객체를 자동으로 탐지하고 위치를 파악
사전 주석: 사람 검토를 위해 모든 객체를 자동 탐지
분류: 이미지 또는 이미지 내 영역을 분류
힌트: 정확한 위치를 드러내지 않고 안내 제공
장면 탐지: 비디오에서 시간 구간 식별
키프레임 탐지: 비디오에서 중요한 순간 찾기
객체 추적: 비디오 프레임 전반에 걸쳐 객체 추적

지원되는 엔드포인트

YOLO 엔드포인트

로컬 추론을 사용한 빠르고 정확한 객체 탐지에 가장 적합합니다.

yaml

ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"  # or yolov8n, yolov8l, yolov8x, yolo-world
    confidence_threshold: 0.5
    iou_threshold: 0.45

지원되는 모델:

YOLOv8 (n/s/m/l/x 변형)
YOLO-World (개방형 어휘 탐지)
사용자 지정 학습 모델

Ollama Vision 엔드포인트

로컬 비전-언어 모델 추론용입니다.

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"  # or llava-llama3, bakllava, llama3.2-vision, qwen2.5-vl
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1

지원되는 모델:

LLaVA (7B, 13B, 34B)
LLaVA-LLaMA3
BakLLaVA
Llama 3.2 Vision (11B, 90B)
Qwen2.5-VL
Moondream

OpenAI Vision 엔드포인트

GPT-4o를 사용한 클라우드 기반 시각 분석용입니다.

yaml

ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"  # or gpt-4o-mini
    max_tokens: 1000
    detail: "auto"  # low, high, or auto

Anthropic Vision 엔드포인트

비전 기능을 갖춘 Claude용입니다.

yaml

ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024

엔드포인트 기능

각 엔드포인트는 서로 다른 강점을 가집니다:

엔드포인트	텍스트 생성	비전	Bbox 출력	키워드	근거
`ollama_vision`	예	예	아니요	아니요	예
`openai_vision`	예	예	아니요	아니요	예
`anthropic_vision`	예	예	아니요	아니요	예
`yolo`	아니요	예	예	아니요	아니요

모범 사례:

정밀한 객체 탐지에는 yolo 엔드포인트를 사용하십시오
설명이 포함된 이미지 분류에는 Qwen-VL 또는 LLaVA가 적용된 ollama_vision 같은 VLLM을 사용하십시오
결합된 워크플로에는 텍스트 엔드포인트와 시각 엔드포인트를 모두 구성하십시오

AI를 사용한 이미지 주석

탐지, 사전 주석, 분류, 힌트 기능을 갖춘 AI 지원 이미지 주석을 구성합니다:

yaml

annotation_schemes:
  - annotation_type: image_annotation
    name: object_detection
    description: "Detect and label objects in the image"
    tools:
      - bbox
      - polygon
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        detection: true      # "Detect" button - find objects
        pre_annotate: true   # "Auto" button - detect all
        classification: false # "Classify" button - classify region
        hint: true           # "Hint" button - get guidance
 
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

AI를 사용한 비디오 주석

yaml

annotation_schemes:
  - annotation_type: video_annotation
    name: scene_segmentation
    description: "Segment video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "action"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        scene_detection: true     # Detect scene boundaries
        keyframe_detection: false
        tracking: false
        pre_annotate: true        # Auto-segment entire video
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Frames to sample for video analysis

분리된 시각 및 텍스트 엔드포인트

각 콘텐츠 유형에 가장 적합한 모델을 사용하여 시각 작업을 위한 별도의 엔드포인트를 구성할 수 있습니다:

yaml

ai_support:
  enabled: true
  endpoint_type: "openai"  # For text annotations
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o-mini"
 
  # Separate visual endpoint
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

또는 텍스트 모델과 함께 비전-언어 모델을 사용합니다:

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"  # Main endpoint for text
  visual_endpoint_type: "ollama_vision"  # Visual endpoint for images
  ai_config:
    model: "llama3.2"
    include:
      all: true
  visual_ai_config:
    model: "qwen2.5-vl:7b"

AI 기능

탐지

구성된 레이블과 일치하는 객체를 찾아 제안 바운딩 박스를 그립니다. 제안은 점선 오버레이로 나타나며 수락하거나 거부할 수 있습니다.

사전 주석 (Auto)

이미지/비디오에서 모든 객체를 자동으로 탐지하고 사람 검토를 위한 제안을 생성합니다. 대규모 데이터셋의 주석 속도를 높이는 데 유용합니다.

분류

선택한 영역 또는 전체 이미지를 분류합니다. 신뢰도 점수와 추론이 포함된 제안 레이블을 반환합니다.

힌트

정확한 답을 드러내지 않고 안내를 제공합니다. 주석자를 교육하거나 AI 지원과 함께 사람의 판단을 원할 때 유용합니다.

장면 탐지 (비디오)

비디오 프레임을 분석하여 장면 경계를 식별하고 레이블이 포함된 시간 구간을 제안합니다.

키프레임 탐지 (비디오)

좋은 주석 지점이 될 만한 비디오의 중요한 순간을 식별합니다.

객체 추적 (비디오)

일관된 추적 주석을 위해 프레임 전반에 걸친 객체 위치를 제안합니다.

AI 제안 사용하기

AI 지원 버튼(Detect, Auto, Hint 등)을 클릭합니다
제안이 점선 오버레이로 나타날 때까지 기다립니다
제안 수락: 제안 오버레이를 더블클릭합니다
제안 거부: 제안 오버레이를 마우스 오른쪽 버튼으로 클릭합니다
모두 수락: 도구 모음에서 "Accept All"을 클릭합니다
모두 지우기: "Clear"를 클릭하여 모든 제안을 제거합니다

탐지 API 응답 형식

json

{
  "detections": [
    {
      "label": "person",
      "bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.5},
      "confidence": 0.95
    }
  ]
}

힌트의 경우:

json

{
  "hint": "Look for objects in the lower right corner",
  "suggestive_choice": "Focus on overlapping regions"
}

비디오 구간의 경우:

json

{
  "segments": [
    {
      "start_time": 0.0,
      "end_time": 5.5,
      "suggested_label": "intro",
      "confidence": 0.85
    }
  ]
}

요구 사항

YOLO 엔드포인트의 경우

bash

pip install ultralytics opencv-python

Ollama Vision의 경우

ollama.ai에서 Ollama를 설치합니다
비전 모델을 다운로드합니다: ollama pull llava
Ollama 서버를 시작합니다 (기본적으로 http://localhost:11434에서 실행)

OpenAI/Anthropic Vision의 경우

환경 또는 구성에 API 키를 설정합니다
비전 지원 모델에 대한 접근 권한이 있는지 확인합니다

문제 해결

"No visual AI endpoint configured"

다음을 확인하십시오:

ai_support.enabled: true를 설정함
비전을 지원하는 유효한 endpoint_type을 설정함 (yolo, ollama_vision, openai_vision, anthropic_vision)
선택한 엔드포인트에 필요한 종속성을 설치함

YOLO가 예상한 객체를 탐지하지 못함

confidence_threshold를 낮춰 보십시오
레이블이 YOLO의 클래스 이름과 일치하는지 확인하십시오 (또는 사용자 지정 어휘에는 YOLO-World를 사용)
모델 파일이 존재하고 유효한지 확인하십시오

Ollama Vision 오류

Ollama가 실행 중인지 확인합니다: curl http://localhost:11434/api/tags
비전 모델을 다운로드했는지 확인합니다: ollama list
모델이 비전을 지원하는지 확인합니다 (llava, bakllava, llama3.2-vision 등)

추가 자료

AI 지원 - 텍스트 기반 AI 지원 (힌트, 키워드, 근거)
이미지 주석 - 이미지 주석 도구 및 구성
인스턴스 표시 - 콘텐츠 표시 구성

구현 세부 사항은 원본 문서를 참고하십시오.