Potato 2.1 引入了视觉 AI 支持，将 AI 驱动的辅助功能直接带入图像和视频标注工作流。您可以让 YOLO 自动检测目标然后审核其建议，或者让视觉语言模型对图像进行分类并解释其推理，而不必从头标注每个边界框。

本指南将逐步介绍每个视觉 AI 端点的设置、不同辅助模式的配置，以及如何将视觉 AI 与 Potato 的文本 AI 功能相结合。

您将学到

设置 YOLO 进行快速本地目标检测
运行 Ollama Vision 模型进行本地图像理解
使用 OpenAI 和 Anthropic 云视觉 API
配置检测、预标注、分类和提示模式
在单个项目中结合视觉和文本 AI 端点
审核 AI 建议的接受/拒绝工作流

前提条件

您需要 Potato 2.1.0 或更高版本：

bash

pip install --upgrade potato-annotation

根据您选择的端点，还需要以下之一：

YOLO：pip install ultralytics opencv-python
Ollama：从 ollama.ai 安装并拉取视觉模型
OpenAI：具有 GPT-4o 访问权限的 API 密钥
Anthropic：具有 Claude 视觉模型访问权限的 API 密钥

选项 1：使用 YOLO 进行目标检测

当您需要完全在本地机器上运行快速、精确的边界框检测时，YOLO 是最佳选择。它擅长检测常见目标（人、车、动物、家具），可以在毫秒内处理图像。

设置

bash

pip install ultralytics opencv-python

配置

yaml

annotation_task_name: "Object Detection with YOLO"
 
data_files:
  - data/images.json
 
item_properties:
  id_key: id
  text_key: image_url
 
instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 800
        zoomable: true
 
annotation_schemes:
  - annotation_type: image_annotation
    name: objects
    description: "Detect and label objects"
    source_field: "image_url"
    tools:
      - bbox
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"
      - name: "cat"
        color: "#96CEB4"
 
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
    iou_threshold: 0.45
 
output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true

数据格式

以 JSONL 格式创建 data/images.json（每行一个 JSON 对象）：

json

{"id": "img_001", "image_url": "images/street_scene_1.jpg"}
{"id": "img_002", "image_url": "images/park_photo.jpg"}
{"id": "img_003", "image_url": "https://example.com/images/office.jpg"}

选择 YOLO 模型

模型	大小	速度	精度	最适合
`yolov8n.pt`	6 MB	最快	较低	快速原型
`yolov8s.pt`	22 MB	快	良好	均衡工作负载
`yolov8m.pt`	50 MB	中等	较好	通用场景
`yolov8l.pt`	84 MB	较慢	高	精度优先
`yolov8x.pt`	131 MB	最慢	最高	最大精度

对于检测不在 YOLO 内置类别中的目标，使用 YOLO-World 进行开放词汇检测：

yaml

ai_config:
  model: "yolo-world"
  confidence_threshold: 0.3

调整检测

如果 YOLO 遗漏了目标，降低置信度阈值：

yaml

ai_config:
  confidence_threshold: 0.3  # More detections, more false positives

如果误检太多，提高阈值：

yaml

ai_config:
  confidence_threshold: 0.7  # Fewer detections, higher precision

选项 2：使用 Ollama Vision 的本地视觉语言模型

Ollama Vision 让您能够在本地运行视觉语言模型。与 YOLO 不同，这些模型可以理解图像上下文、分类场景并生成文本解释 — 所有这些都无需将数据发送到云 API。

设置

bash

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
 
# Pull a vision model
ollama pull llava
 
# Or for better performance:
ollama pull qwen2.5-vl:7b

配置

yaml

annotation_task_name: "Image Classification with Ollama Vision"
 
data_files:
  - data/images.json
 
item_properties:
  id_key: id
  text_key: image_url
 
instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 600
        zoomable: true
 
annotation_schemes:
  - annotation_type: radio
    name: scene_type
    description: "What type of scene is shown?"
    labels:
      - indoor
      - outdoor_urban
      - outdoor_nature
      - aerial
      - underwater
 
    ai_support:
      enabled: true
      features:
        hint: true
        classification: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1
 
output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true

支持的模型

模型	参数量	优势
`llava:7b`	7B	快速，通用理解良好
`llava:13b`	13B	更好的准确性
`llava-llama3`	8B	强推理能力
`bakllava`	7B	视觉细节好
`llama3.2-vision:11b`	11B	最新 Llama 视觉模型
`qwen2.5-vl:7b`	7B	强多语言 + 视觉
`moondream`	1.8B	非常快速，轻量级

选项 3：OpenAI Vision

OpenAI Vision 通过 GPT-4o 提供高质量的图像理解。当您需要最强大的视觉模型且不介意云 API 成本时最为合适。

配置

yaml

ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"
    max_tokens: 1000
    detail: "auto"  # "low" for faster/cheaper, "high" for detail

设置您的 API 密钥：

bash

export OPENAI_API_KEY="sk-..."

detail 参数控制发送到 API 的图像分辨率：

low — 更快、更便宜，适合分类
high — 全分辨率，更适合查找小目标
auto — 让 API 自行决定

选项 4：Anthropic Vision

Claude 的视觉能力在理解图像上下文和提供详细解释方面表现出色。

配置

yaml

ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024

bash

export ANTHROPIC_API_KEY="sk-ant-..."

AI 辅助模式

每个视觉 AI 端点支持不同的辅助模式。为每个标注方案仅启用所需的模式。

检测模式

查找与您配置的标签匹配的目标，并以虚线边界框叠加层显示：

yaml

ai_support:
  enabled: true
  features:
    detection: true

标注者点击"检测"后，AI 建议以虚线叠加层形式出现在图像上。双击接受，右击拒绝。

预标注（自动）模式

自动检测所有目标并一次性创建建议。最适合引导大型数据集：

yaml

ai_support:
  enabled: true
  features:
    pre_annotate: true

分类模式

对选定区域或整个图像进行分类，返回带有置信度分数的建议标签：

yaml

ai_support:
  enabled: true
  features:
    classification: true

提示模式

提供指导文本而不直接给出答案。适合培训新标注者：

yaml

ai_support:
  enabled: true
  features:
    hint: true

接受/拒绝工作流

当标注者点击 AI 辅助按钮时，建议以虚线叠加层形式出现：

接受建议 — 双击虚线叠加层将其转换为真实标注
拒绝建议 — 右击叠加层将其消除
全部接受 — 点击工具栏中的"全部接受"一次性接受所有建议
全部清除 — 点击"清除"消除所有建议

这使标注者保持控制权，同时减少了从头绘制边界框的手动工作。

视频标注中的视觉 AI

视觉 AI 也适用于视频标注任务。您可以启用场景检测、关键帧检测和目标跟踪：

yaml

annotation_schemes:
  - annotation_type: video_annotation
    name: scenes
    description: "Segment this video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "main_content"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        scene_detection: true
        pre_annotate: true
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Number of frames to sample

max_frames 参数控制 AI 从视频中采样多少帧进行分析。帧数越多意味着准确性越高，但处理速度越慢。

结合视觉和文本 AI 端点

如果您的项目同时包含文本和图像标注，可以为每种类型配置不同的端点。使用文本优化的模型处理提示和关键词，使用视觉模型处理检测：

yaml

ai_support:
  enabled: true
 
  # Text AI for radio buttons, text schemes, etc.
  endpoint_type: "ollama"
  ai_config:
    model: "llama3.2"
    include:
      all: true
 
  # Visual AI for image/video schemes
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

或者使用云视觉模型搭配本地文本模型：

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "openai_vision"
  ai_config:
    model: "llama3.2"
  visual_ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"

完整示例：产品照片标注

以下是一个生产就绪的配置，用于使用 YOLO 检测和文本 AI 提示标注产品照片：

yaml

annotation_task_name: "Product Photo Annotation"
 
data_files:
  - data/product_photos.json
 
item_properties:
  id_key: sku
  text_key: photo_url
 
instance_display:
  layout:
    direction: horizontal
    gap: 24px
  fields:
    - key: photo_url
      type: image
      label: "Product Photo"
      display_options:
        max_width: 600
        zoomable: true
    - key: product_description
      type: text
      label: "Product Details"
 
annotation_schemes:
  - annotation_type: image_annotation
    name: product_regions
    description: "Draw boxes around products and defects"
    source_field: "photo_url"
    tools:
      - bbox
    labels:
      - name: "product"
        color: "#4ECDC4"
      - name: "defect"
        color: "#FF6B6B"
      - name: "label"
        color: "#45B7D1"
      - name: "packaging"
        color: "#96CEB4"
 
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
 
  - annotation_type: radio
    name: photo_quality
    description: "Is this photo suitable for the product listing?"
    labels:
      - Approved
      - Needs editing
      - Reshoot required
 
  - annotation_type: multiselect
    name: quality_issues
    description: "Select any issues present"
    labels:
      - Blurry
      - Poor lighting
      - Wrong angle
      - Background clutter
      - Color inaccurate
 
ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "yolo"
 
  ai_config:
    model: "llama3.2"
    include:
      all: true
 
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
 
output_annotation_dir: "annotation_output/"
export_annotation_format: "json"
user_config:
  allow_all_users: true

示例数据（data/product_photos.json）：

json

{"sku": "SKU-001", "photo_url": "images/products/laptop_front.jpg", "product_description": "15-inch laptop, silver finish"}
{"sku": "SKU-002", "photo_url": "images/products/headphones_side.jpg", "product_description": "Over-ear wireless headphones, black"}
{"sku": "SKU-003", "photo_url": "images/products/backpack_full.jpg", "product_description": "40L hiking backpack, navy blue"}

视觉 AI 标注技巧

对大型数据集使用预标注 — 使用自动按钮为所有目标生成建议，然后让标注者审核和纠正，而不是从头绘制
将端点与任务匹配 — YOLO 用于精确检测，视觉语言模型用于分类和理解
调整置信度阈值 — 从 0.5 开始，根据您观察到的误检/漏检权衡进行调整
使用提示进行标注者培训 — 提示模式引导标注者而不会使其偏向特定答案
组合端点 — 用 YOLO 视觉端点检测加上 Ollama 文本端点提示，可以两全其美
缓存 AI 结果 — 启用磁盘缓存以避免对相同图像重复运行检测

故障排除

"未配置视觉 AI 端点"

确保 ai_support.enabled 为 true，并且您设置了支持视觉的 endpoint_type：yolo、ollama_vision、openai_vision 或 anthropic_vision。

YOLO 检测不到您的目标

YOLO 的内置类别覆盖 80 种常见目标。如果您的标签与 YOLO 的类名不匹配，请尝试使用 YOLO-World 进行开放词汇检测，或降低 confidence_threshold。

Ollama 返回错误

验证 Ollama 正在运行且您已拉取了视觉模型：

bash

curl http://localhost:11434/api/tags  # Check Ollama is running
ollama list                           # Check installed models

云 API 响应缓慢

启用缓存，使同一图像不会被分析两次：

yaml

ai_support:
  cache_config:
    disk_cache:
      enabled: true
      path: "ai_cache/visual_cache.json"

下一步

阅读完整的视觉 AI 支持文档了解 API 参考详情
设置实例显示以将图像与其他内容类型一起显示
探索文本 AI 支持了解提示和关键词高亮

完整文档请见 /docs/features/visual-ai-support。