视觉 AI 支持

使用视觉模型为图像和视频标注提供 AI 辅助。

视觉 AI 支持

v2.1.0 新增

Potato 使用各种视觉模型为图像和视频标注任务提供 AI 辅助，包括用于目标检测的 YOLO 和视觉语言模型（VLLM），如 GPT-4o、Claude 和 Ollama 视觉模型。

概述

视觉 AI 支持可实现：

目标检测：使用 YOLO 或 VLLM 自动检测和定位图像中的物体
预标注：自动检测所有物体供人工审核
分类：对图像或图像中的区域进行分类
提示：在不揭示精确位置的情况下提供指导
场景检测：识别视频中的时间片段
关键帧检测：在视频中找到重要时刻
目标追踪：跨视频帧追踪物体

支持的端点

YOLO 端点

最适合使用本地推理进行快速、准确的目标检测。

yaml

ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"  # or yolov8n, yolov8l, yolov8x, yolo-world
    confidence_threshold: 0.5
    iou_threshold: 0.45

支持的模型：

YOLOv8（n/s/m/l/x 变体）
YOLO-World（开放词汇检测）
自定义训练模型

Ollama Vision 端点

用于本地视觉语言模型推理。

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"  # or llava-llama3, bakllava, llama3.2-vision, qwen2.5-vl
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1

支持的模型：

LLaVA（7B、13B、34B）
LLaVA-LLaMA3
BakLLaVA
Llama 3.2 Vision（11B、90B）
Qwen2.5-VL
Moondream

OpenAI Vision 端点

使用 GPT-4o 进行基于云的视觉分析。

yaml

ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"  # or gpt-4o-mini
    max_tokens: 1000
    detail: "auto"  # low, high, or auto

Anthropic Vision 端点

用于具有视觉能力的 Claude。

yaml

ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024

端点能力

每个端点有不同的优势：

端点	文本生成	视觉	边界框输出	关键词	理由
`ollama_vision`	是	是	否	否	是
`openai_vision`	是	是	否	否	是
`anthropic_vision`	是	是	否	否	是
`yolo`	否	是	是	否	否

最佳实践：

精确目标检测使用 yolo 端点
带解释的图像分类使用如 ollama_vision 配合 Qwen-VL 或 LLaVA 等 VLLM
组合工作流同时配置文本端点和视觉端点

图像标注与 AI

配置带检测、预标注、分类和提示功能的 AI 辅助图像标注：

yaml

annotation_schemes:
  - annotation_type: image_annotation
    name: object_detection
    description: "Detect and label objects in the image"
    tools:
      - bbox
      - polygon
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        detection: true      # "Detect" button - find objects
        pre_annotate: true   # "Auto" button - detect all
        classification: false # "Classify" button - classify region
        hint: true           # "Hint" button - get guidance
 
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

视频标注与 AI

yaml

annotation_schemes:
  - annotation_type: video_annotation
    name: scene_segmentation
    description: "Segment video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "action"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        scene_detection: true     # Detect scene boundaries
        keyframe_detection: false
        tracking: false
        pre_annotate: true        # Auto-segment entire video
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Frames to sample for video analysis

分离视觉和文本端点

你可以为视觉任务配置单独的端点，为每种内容类型使用最佳模型：

yaml

ai_support:
  enabled: true
  endpoint_type: "openai"  # For text annotations
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o-mini"
 
  # Separate visual endpoint
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

或将视觉语言模型与文本模型一起使用：

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"  # Main endpoint for text
  visual_endpoint_type: "ollama_vision"  # Visual endpoint for images
  ai_config:
    model: "llama3.2"
    include:
      all: true
  visual_ai_config:
    model: "qwen2.5-vl:7b"

AI 功能

检测

查找匹配已配置标签的物体并绘制建议边界框。建议以虚线叠加层形式出现，可以接受或拒绝。

预标注（自动）

自动检测图像/视频中的所有物体并创建供人工审核的建议。适用于加速大数据集的标注。

分类

对选定区域或整张图像进行分类。返回带置信度分数和推理的建议标签。

提示

提供指导但不揭示精确答案。适用于培训标注者或希望以 AI 辅助进行人工判断的场景。

场景检测（视频）

分析视频帧以识别场景边界，并建议带标签的时间片段。

关键帧检测（视频）

识别视频中具有重要意义的时刻，作为良好的标注点。

目标追踪（视频）

建议跨帧的物体位置，用于一致的追踪标注。

使用 AI 建议

点击 AI 辅助按钮（检测、自动、提示等）
等待建议以虚线叠加层形式出现
接受建议：双击建议叠加层
拒绝建议：右键点击建议叠加层
全部接受：点击工具栏中的"全部接受"
清除全部：点击"清除"删除所有建议

检测 API 响应格式

json

{
  "detections": [
    {
      "label": "person",
      "bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.5},
      "confidence": 0.95
    }
  ]
}

提示格式：

json

{
  "hint": "Look for objects in the lower right corner",
  "suggestive_choice": "Focus on overlapping regions"
}

视频片段格式：

json

{
  "segments": [
    {
      "start_time": 0.0,
      "end_time": 5.5,
      "suggested_label": "intro",
      "confidence": 0.85
    }
  ]
}

依赖要求

YOLO 端点

bash

pip install ultralytics opencv-python

Ollama Vision

从 ollama.ai 安装 Ollama
拉取视觉模型：ollama pull llava
启动 Ollama 服务器（默认运行在 http://localhost:11434）

OpenAI/Anthropic Vision

在环境或配置中设置 API 密钥
确保你有权访问支持视觉的模型

故障排除

"未配置视觉 AI 端点"

确保你已：

设置 ai_support.enabled: true
设置支持视觉的有效 endpoint_type（yolo、ollama_vision、openai_vision、anthropic_vision）
为所选端点安装了必需的依赖

YOLO 未检测到预期物体

尝试降低 confidence_threshold
确保标签与 YOLO 的类名匹配（或使用 YOLO-World 处理自定义词汇）
检查模型文件是否存在且有效

Ollama Vision 错误

验证 Ollama 正在运行：curl http://localhost:11434/api/tags
确保已拉取视觉模型：ollama list
检查模型是否支持视觉（llava、bakllava、llama3.2-vision 等）

视觉 AI 支持

视觉 AI 支持

概述

支持的端点

YOLO 端点

Ollama Vision 端点

OpenAI Vision 端点

Anthropic Vision 端点

端点能力

图像标注与 AI

视频标注与 AI

分离视觉和文本端点

AI 功能

检测

预标注（自动）

分类

提示

场景检测（视频）

关键帧检测（视频）

目标追踪（视频）

使用 AI 建议

检测 API 响应格式

依赖要求

YOLO 端点

Ollama Vision

OpenAI/Anthropic Vision

故障排除

"未配置视觉 AI 端点"

YOLO 未检测到预期物体

Ollama Vision 错误

延伸阅读