Potato 2.1は、画像・ビデオアノテーションワークフローにAI搭載の支援を直接もたらすビジュアルAIサポートを導入しました。すべてのバウンディングボックスをゼロから描く代わりに、YOLOに自動でオブジェクトを検出させて提案をレビューしたり、ビジョン言語モデルに画像の分類と推論の説明を依頼したりできます。

本ガイドでは、各ビジュアルAIエンドポイントの設定、さまざまな支援モードの設定、ビジュアルAIとPotatoのテキストベースAI機能の組み合わせについて説明します。

学ぶこと

高速なローカルオブジェクト検出のためのYOLOの設定
ローカル画像理解のためのOllama Visionモデルの実行
OpenAIおよびAnthropicクラウドビジョンAPIの使用
検出、事前アノテーション、分類、ヒントモードの設定
単一プロジェクトでのビジュアルとテキストAIエンドポイントの組み合わせ
AI提案のレビューのための承認/却下ワークフロー

前提条件

Potato 2.1.0以降が必要です：

bash

pip install --upgrade potato-annotation

また、選択するエンドポイントに応じて以下のいずれかが必要です：

YOLO: pip install ultralytics opencv-python
Ollama: ollama.aiからインストールしてビジョンモデルをプル
OpenAI: GPT-4oにアクセスできるAPIキー
Anthropic: ClaudeビジョンモデルにアクセスできるAPIキー

オプション1：オブジェクト検出のためのYOLO

YOLOは、ローカルマシン上で完全に動作する高速で精密なバウンディングボックス検出が必要な場合に最適です。一般的なオブジェクト（人、車、動物、家具）の検出に優れ、ミリ秒単位で画像を処理できます。

セットアップ

bash

pip install ultralytics opencv-python

設定

yaml

annotation_task_name: "Object Detection with YOLO"
 
data_files:
  - data/images.json
 
item_properties:
  id_key: id
  text_key: image_url
 
instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 800
        zoomable: true
 
annotation_schemes:
  - annotation_type: image_annotation
    name: objects
    description: "Detect and label objects"
    source_field: "image_url"
    tools:
      - bbox
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"
      - name: "cat"
        color: "#96CEB4"
 
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
    iou_threshold: 0.45
 
output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true

データフォーマット

data/images.jsonをJSONL形式（1行に1つのJSONオブジェクト）で作成します：

json

{"id": "img_001", "image_url": "images/street_scene_1.jpg"}
{"id": "img_002", "image_url": "images/park_photo.jpg"}
{"id": "img_003", "image_url": "https://example.com/images/office.jpg"}

YOLOモデルの選択

モデル	サイズ	速度	精度	最適な用途
`yolov8n.pt`	6 MB	最速	低め	素早いプロトタイピング
`yolov8s.pt`	22 MB	速い	良好	バランスの取れたワークロード
`yolov8m.pt`	50 MB	中程度	より良い	汎用
`yolov8l.pt`	84 MB	遅め	高い	精度が重要な場合
`yolov8x.pt`	131 MB	最遅	最高	最大精度

YOLOの組み込みクラスにないオブジェクトを検出するには、オープンボキャブラリ検出用のYOLO-Worldを使用します：

yaml

ai_config:
  model: "yolo-world"
  confidence_threshold: 0.3

検出のチューニング

YOLOがオブジェクトを見逃している場合、信頼度閾値を下げます：

yaml

ai_config:
  confidence_threshold: 0.3  # More detections, more false positives

誤検出が多すぎる場合は、閾値を上げます：

yaml

ai_config:
  confidence_threshold: 0.7  # Fewer detections, higher precision

オプション2：ローカルVLLMのためのOllama Vision

Ollama Visionは、ビジョン言語モデルのパワーをローカルで提供します。YOLOとは異なり、これらのモデルは画像の文脈を理解し、シーンを分類し、テキストによる説明を生成できます。すべてクラウドAPIにデータを送信せずに実行できます。

セットアップ

bash

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
 
# Pull a vision model
ollama pull llava
 
# Or for better performance:
ollama pull qwen2.5-vl:7b

設定

yaml

annotation_task_name: "Image Classification with Ollama Vision"
 
data_files:
  - data/images.json
 
item_properties:
  id_key: id
  text_key: image_url
 
instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 600
        zoomable: true
 
annotation_schemes:
  - annotation_type: radio
    name: scene_type
    description: "What type of scene is shown?"
    labels:
      - indoor
      - outdoor_urban
      - outdoor_nature
      - aerial
      - underwater
 
    ai_support:
      enabled: true
      features:
        hint: true
        classification: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1
 
output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true

サポートされているモデル

モデル	パラメータ	強み
`llava:7b`	7B	高速、良好な汎用理解
`llava:13b`	13B	より高い精度
`llava-llama3`	8B	強力な推論
`bakllava`	7B	良好なビジュアル詳細
`llama3.2-vision:11b`	11B	最新のLlamaビジョン
`qwen2.5-vl:7b`	7B	強力な多言語+ビジョン
`moondream`	1.8B	非常に高速、軽量

オプション3：OpenAI Vision

OpenAI VisionはGPT-4oを通じて高品質な画像理解を提供します。最も能力の高いビジョンモデルが必要で、クラウドAPIのコストが問題にならない場合に最適です。

設定

yaml

ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"
    max_tokens: 1000
    detail: "auto"  # "low" for faster/cheaper, "high" for detail

APIキーを設定します：

bash

export OPENAI_API_KEY="sk-..."

detailパラメータはAPIに送信される画像解像度を制御します：

low — 高速で安価、分類に適している
high — フル解像度、小さなオブジェクトの発見に適している
auto — APIに判断を任せる

オプション4：Anthropic Vision

Claudeのビジョン機能は、画像の文脈理解と詳細な説明の提供に強みがあります。

設定

yaml

ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024

bash

export ANTHROPIC_API_KEY="sk-ant-..."

AI支援モード

各ビジュアルAIエンドポイントは異なる支援モードをサポートしています。アノテーションスキームごとに必要なものだけを有効にしてください。

検出モード

設定されたラベルに一致するオブジェクトを見つけ、破線のバウンディングボックスオーバーレイとして表示します：

yaml

ai_support:
  enabled: true
  features:
    detection: true

アノテーターが「Detect」をクリックすると、AI提案が画像上に破線のオーバーレイとして表示されます。ダブルクリックで承認、右クリックで却下します。

事前アノテーション（自動）モード

すべてのオブジェクトを自動検出し、一度に提案を作成します。大規模データセットのブートストラップに最適です：

yaml

ai_support:
  enabled: true
  features:
    pre_annotate: true

分類モード

選択した領域または画像全体を分類し、信頼度スコア付きの提案ラベルを返します：

yaml

ai_support:
  enabled: true
  features:
    classification: true

ヒントモード

回答を明かさずにガイダンステキストを提供します。新しいアノテーターのトレーニングに適しています：

yaml

ai_support:
  enabled: true
  features:
    hint: true

承認/却下ワークフロー

アノテーターがAI支援ボタンをクリックすると、提案が破線のオーバーレイとして表示されます：

提案を承認 — 破線のオーバーレイをダブルクリックして実際のアノテーションに変換
提案を却下 — オーバーレイを右クリックして却下
すべて承認 — ツールバーの「Accept All」をクリックしてすべての提案を一度に承認
すべてクリア — 「Clear」をクリックしてすべての提案を却下

これにより、ゼロからボックスを描く手動作業を減らしながら、アノテーターが主導権を保ちます。

ビジュアルAIによるビデオアノテーション

ビジュアルAIはビデオアノテーションタスクでも機能します。シーン検出、キーフレーム検出、オブジェクトトラッキングを有効にできます：

yaml

annotation_schemes:
  - annotation_type: video_annotation
    name: scenes
    description: "Segment this video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "main_content"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        scene_detection: true
        pre_annotate: true
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Number of frames to sample

max_framesパラメータは、AIが分析のためにビデオからサンプリングするフレーム数を制御します。フレーム数が多いほど精度は向上しますが、処理は遅くなります。

ビジュアルとテキストAIエンドポイントの組み合わせ

プロジェクトにテキストと画像の両方のアノテーションがある場合、それぞれに別々のエンドポイントを設定できます。ヒントやキーワードにはテキスト最適化モデルを、検出にはビジョンモデルを使用します：

yaml

ai_support:
  enabled: true
 
  # Text AI for radio buttons, text schemes, etc.
  endpoint_type: "ollama"
  ai_config:
    model: "llama3.2"
    include:
      all: true
 
  # Visual AI for image/video schemes
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

または、クラウドビジョンモデルとローカルテキストモデルを組み合わせます：

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "openai_vision"
  ai_config:
    model: "llama3.2"
  visual_ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"

完全な例：商品写真アノテーション

YOLO検出とテキストベースのAIヒントを使用した商品写真アノテーションの本番対応設定：

yaml

annotation_task_name: "Product Photo Annotation"
 
data_files:
  - data/product_photos.json
 
item_properties:
  id_key: sku
  text_key: photo_url
 
instance_display:
  layout:
    direction: horizontal
    gap: 24px
  fields:
    - key: photo_url
      type: image
      label: "Product Photo"
      display_options:
        max_width: 600
        zoomable: true
    - key: product_description
      type: text
      label: "Product Details"
 
annotation_schemes:
  - annotation_type: image_annotation
    name: product_regions
    description: "Draw boxes around products and defects"
    source_field: "photo_url"
    tools:
      - bbox
    labels:
      - name: "product"
        color: "#4ECDC4"
      - name: "defect"
        color: "#FF6B6B"
      - name: "label"
        color: "#45B7D1"
      - name: "packaging"
        color: "#96CEB4"
 
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
 
  - annotation_type: radio
    name: photo_quality
    description: "Is this photo suitable for the product listing?"
    labels:
      - Approved
      - Needs editing
      - Reshoot required
 
  - annotation_type: multiselect
    name: quality_issues
    description: "Select any issues present"
    labels:
      - Blurry
      - Poor lighting
      - Wrong angle
      - Background clutter
      - Color inaccurate
 
ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "yolo"
 
  ai_config:
    model: "llama3.2"
    include:
      all: true
 
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
 
output_annotation_dir: "annotation_output/"
export_annotation_format: "json"
user_config:
  allow_all_users: true

サンプルデータ（data/product_photos.json）：

json

{"sku": "SKU-001", "photo_url": "images/products/laptop_front.jpg", "product_description": "15-inch laptop, silver finish"}
{"sku": "SKU-002", "photo_url": "images/products/headphones_side.jpg", "product_description": "Over-ear wireless headphones, black"}
{"sku": "SKU-003", "photo_url": "images/products/backpack_full.jpg", "product_description": "40L hiking backpack, navy blue"}

ビジュアルAIアノテーションのコツ

大規模データセットには事前アノテーションから始める — Autoボタンを使用してすべてのオブジェクトの提案を生成し、アノテーターにゼロから描くのではなくレビューと修正をさせる
エンドポイントをタスクに合わせる — 精密な検出にはYOLO、分類と理解にはVLLM
信頼度閾値を調整する — 0.5から始めて、観察される偽陽性/偽陰性のトレードオフに基づいて調整する
アノテーターのトレーニングにはヒントを使用する — ヒントモードは特定の回答にバイアスをかけずにアノテーターを導く
エンドポイントを組み合わせる — 検出用のYOLOビジュアルエンドポイントとヒント用のOllamaテキストエンドポイントで両方の利点を得る
AI結果をキャッシュする — 同じ画像の再検出を避けるためにディスクキャッシュを有効にする

トラブルシューティング

「No visual AI endpoint configured」

ai_support.enabledがtrueで、ビジョンをサポートするendpoint_type（yolo、ollama_vision、openai_vision、anthropic_vision）が設定されていることを確認してください。

YOLOがオブジェクトを検出しない

YOLOの組み込みクラスは80の一般的なオブジェクトをカバーしています。ラベルがYOLOのクラス名と一致しない場合は、オープンボキャブラリ検出用のYOLO-Worldを試すか、confidence_thresholdを下げてください。

Ollamaがエラーを返す

Ollamaが実行中でビジョンモデルがプルされていることを確認してください：

bash

curl http://localhost:11434/api/tags  # Check Ollama is running
ollama list                           # Check installed models

クラウドAPIの応答が遅い

同じ画像が2回分析されないようにキャッシュを有効にします：

yaml

ai_support:
  cache_config:
    disk_cache:
      enabled: true
      path: "ai_cache/visual_cache.json"

次のステップ

APIリファレンスの詳細はビジュアルAIサポートドキュメントの全文を読む
他のコンテンツタイプと並べて画像を表示するためのインスタンス表示を設定する
ヒントとキーワードハイライトのためのテキストベースAIサポートを探索する

完全なドキュメントは/docs/features/visual-ai-supportをご覧ください。