Visual AI Support

Use vision LLMs — GPT-4 Vision, Claude Vision, Gemini, and YOLO — to pre-annotate images, generate bounding box suggestions, and assist with visual tasks in Potato.

Visual AI Support

New in v2.1.0

Potato provides AI-powered assistance for image and video annotation tasks using various vision models including YOLO for object detection and vision-language models (VLLMs) like GPT-4o, Claude, and Ollama vision models.

Overview

Visual AI support enables:

Object Detection: Automatically detect and locate objects in images using YOLO or VLLMs
Pre-annotation: Auto-detect all objects for human review
Classification: Classify images or regions within images
Hints: Get guidance without revealing exact locations
Scene Detection: Identify temporal segments in videos
Keyframe Detection: Find significant moments in videos
Object Tracking: Track objects across video frames

Supported Endpoints

YOLO Endpoint

Best for fast, accurate object detection using local inference.

yaml

ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"  # or yolov8n, yolov8l, yolov8x, yolo-world
    confidence_threshold: 0.5
    iou_threshold: 0.45

Supported models:

YOLOv8 (n/s/m/l/x variants)
YOLO-World (open-vocabulary detection)
Custom trained models

Ollama Vision Endpoint

For local vision-language model inference.

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"  # or llava-llama3, bakllava, llama3.2-vision, qwen2.5-vl
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1

Supported models:

LLaVA (7B, 13B, 34B)
LLaVA-LLaMA3
BakLLaVA
Llama 3.2 Vision (11B, 90B)
Qwen2.5-VL
Moondream

OpenAI Vision Endpoint

For cloud-based vision analysis using GPT-4o.

yaml

ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"  # or gpt-4o-mini
    max_tokens: 1000
    detail: "auto"  # low, high, or auto

Anthropic Vision Endpoint

For Claude with vision capabilities.

yaml

ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024

Endpoint Capabilities

Each endpoint has different strengths:

Endpoint	Text Gen	Vision	Bbox Output	Keyword	Rationale
`ollama_vision`	Yes	Yes	No	No	Yes
`openai_vision`	Yes	Yes	No	No	Yes
`anthropic_vision`	Yes	Yes	No	No	Yes
`yolo`	No	Yes	Yes	No	No

Best practices:

For precise object detection, use the yolo endpoint
For image classification with explanations, use a VLLM like ollama_vision with Qwen-VL or LLaVA
For combined workflows, configure both a text endpoint and a visual endpoint

Image Annotation with AI

Configure AI-assisted image annotation with detection, pre-annotation, classification, and hint features:

yaml

annotation_schemes:
  - annotation_type: image_annotation
    name: object_detection
    description: "Detect and label objects in the image"
    tools:
      - bbox
      - polygon
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        detection: true      # "Detect" button - find objects
        pre_annotate: true   # "Auto" button - detect all
        classification: false # "Classify" button - classify region
        hint: true           # "Hint" button - get guidance
 
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

Video Annotation with AI

yaml

annotation_schemes:
  - annotation_type: video_annotation
    name: scene_segmentation
    description: "Segment video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "action"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        scene_detection: true     # Detect scene boundaries
        keyframe_detection: false
        tracking: false
        pre_annotate: true        # Auto-segment entire video
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Frames to sample for video analysis

Separate Visual and Text Endpoints

You can configure a separate endpoint for visual tasks, using the best model for each content type:

yaml

ai_support:
  enabled: true
  endpoint_type: "openai"  # For text annotations
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o-mini"
 
  # Separate visual endpoint
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

Or using a vision-language model alongside a text model:

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"  # Main endpoint for text
  visual_endpoint_type: "ollama_vision"  # Visual endpoint for images
  ai_config:
    model: "llama3.2"
    include:
      all: true
  visual_ai_config:
    model: "qwen2.5-vl:7b"

AI Features

Detection

Finds objects matching the configured labels and draws suggestion bounding boxes. Suggestions appear as dashed overlays that can be accepted or rejected.

Pre-annotation (Auto)

Automatically detects all objects in the image/video and creates suggestions for human review. Useful for speeding up annotation of large datasets.

Classification

Classifies a selected region or the entire image. Returns a suggested label with confidence score and reasoning.

Hints

Provides guidance without revealing exact answers. Good for training annotators or when you want human judgment with AI assistance.

Scene Detection (Video)

Analyzes video frames to identify scene boundaries and suggests temporal segments with labels.

Keyframe Detection (Video)

Identifies significant moments in a video that would make good annotation points.

Object Tracking (Video)

Suggests object positions across frames for consistent tracking annotation.

Using AI Suggestions

Click the AI assistance button (Detect, Auto, Hint, etc.)
Wait for suggestions to appear as dashed overlays
Accept a suggestion: Double-click the suggestion overlay
Reject a suggestion: Right-click the suggestion overlay
Accept all: Click "Accept All" in the toolbar
Clear all: Click "Clear" to remove all suggestions

Detection API Response Format

json

{
  "detections": [
    {
      "label": "person",
      "bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.5},
      "confidence": 0.95
    }
  ]
}

For hints:

json

{
  "hint": "Look for objects in the lower right corner",
  "suggestive_choice": "Focus on overlapping regions"
}

For video segments:

json

{
  "segments": [
    {
      "start_time": 0.0,
      "end_time": 5.5,
      "suggested_label": "intro",
      "confidence": 0.85
    }
  ]
}

Requirements

For YOLO endpoint

bash

pip install ultralytics opencv-python

For Ollama Vision

Install Ollama from ollama.ai
Pull a vision model: ollama pull llava
Start Ollama server (runs on http://localhost:11434 by default)

For OpenAI/Anthropic Vision

Set API key in environment or config
Ensure you have access to vision-capable models

Troubleshooting

"No visual AI endpoint configured"

Ensure you have:

Set ai_support.enabled: true
Set a valid endpoint_type that supports vision (yolo, ollama_vision, openai_vision, anthropic_vision)
Installed required dependencies for your chosen endpoint

YOLO not detecting expected objects

Try lowering confidence_threshold
Ensure your labels match YOLO's class names (or use YOLO-World for custom vocabularies)
Check that the model file exists and is valid

Ollama Vision errors

Verify Ollama is running: curl http://localhost:11434/api/tags
Ensure you've pulled a vision model: ollama list
Check model supports vision (llava, bakllava, llama3.2-vision, etc.)

Visual AI Support

Visual AI Support

Overview

Supported Endpoints

YOLO Endpoint

Ollama Vision Endpoint

OpenAI Vision Endpoint

Anthropic Vision Endpoint

Endpoint Capabilities

Image Annotation with AI

Video Annotation with AI

Separate Visual and Text Endpoints

AI Features

Detection

Pre-annotation (Auto)

Classification

Hints

Scene Detection (Video)

Keyframe Detection (Video)

Object Tracking (Video)

Using AI Suggestions

Detection API Response Format

Requirements

For YOLO endpoint

For Ollama Vision

For OpenAI/Anthropic Vision

Troubleshooting

"No visual AI endpoint configured"

YOLO not detecting expected objects

Ollama Vision errors

Further Reading