Visual AI Support
AI-powered assistance for image and video annotation using vision models.
Visual AI Support
New in v2.1.0
Potato provides AI-powered assistance for image and video annotation tasks using various vision models including YOLO for object detection and vision-language models (VLLMs) like GPT-4o, Claude, and Ollama vision models.
Overview
Visual AI support enables:
- Object Detection: Automatically detect and locate objects in images using YOLO or VLLMs
- Pre-annotation: Auto-detect all objects for human review
- Classification: Classify images or regions within images
- Hints: Get guidance without revealing exact locations
- Scene Detection: Identify temporal segments in videos
- Keyframe Detection: Find significant moments in videos
- Object Tracking: Track objects across video frames
Supported Endpoints
YOLO Endpoint
Best for fast, accurate object detection using local inference.
ai_support:
enabled: true
endpoint_type: "yolo"
ai_config:
model: "yolov8m.pt" # or yolov8n, yolov8l, yolov8x, yolo-world
confidence_threshold: 0.5
iou_threshold: 0.45Supported models:
- YOLOv8 (n/s/m/l/x variants)
- YOLO-World (open-vocabulary detection)
- Custom trained models
Ollama Vision Endpoint
For local vision-language model inference.
ai_support:
enabled: true
endpoint_type: "ollama_vision"
ai_config:
model: "llava:latest" # or llava-llama3, bakllava, llama3.2-vision, qwen2.5-vl
base_url: "http://localhost:11434"
max_tokens: 500
temperature: 0.1Supported models:
- LLaVA (7B, 13B, 34B)
- LLaVA-LLaMA3
- BakLLaVA
- Llama 3.2 Vision (11B, 90B)
- Qwen2.5-VL
- Moondream
OpenAI Vision Endpoint
For cloud-based vision analysis using GPT-4o.
ai_support:
enabled: true
endpoint_type: "openai_vision"
ai_config:
api_key: "${OPENAI_API_KEY}"
model: "gpt-4o" # or gpt-4o-mini
max_tokens: 1000
detail: "auto" # low, high, or autoAnthropic Vision Endpoint
For Claude with vision capabilities.
ai_support:
enabled: true
endpoint_type: "anthropic_vision"
ai_config:
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-sonnet-4-20250514"
max_tokens: 1024Endpoint Capabilities
Each endpoint has different strengths:
| Endpoint | Text Gen | Vision | Bbox Output | Keyword | Rationale |
|---|---|---|---|---|---|
ollama_vision | Yes | Yes | No | No | Yes |
openai_vision | Yes | Yes | No | No | Yes |
anthropic_vision | Yes | Yes | No | No | Yes |
yolo | No | Yes | Yes | No | No |
Best practices:
- For precise object detection, use the
yoloendpoint - For image classification with explanations, use a VLLM like
ollama_visionwith Qwen-VL or LLaVA - For combined workflows, configure both a text endpoint and a visual endpoint
Image Annotation with AI
Configure AI-assisted image annotation with detection, pre-annotation, classification, and hint features:
annotation_schemes:
- annotation_type: image_annotation
name: object_detection
description: "Detect and label objects in the image"
tools:
- bbox
- polygon
labels:
- name: "person"
color: "#FF6B6B"
- name: "car"
color: "#4ECDC4"
- name: "dog"
color: "#45B7D1"
ai_support:
enabled: true
features:
detection: true # "Detect" button - find objects
pre_annotate: true # "Auto" button - detect all
classification: false # "Classify" button - classify region
hint: true # "Hint" button - get guidance
ai_support:
enabled: true
endpoint_type: "yolo"
ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5Video Annotation with AI
annotation_schemes:
- annotation_type: video_annotation
name: scene_segmentation
description: "Segment video into scenes"
mode: segment
labels:
- name: "intro"
color: "#4ECDC4"
- name: "action"
color: "#FF6B6B"
- name: "outro"
color: "#45B7D1"
ai_support:
enabled: true
features:
scene_detection: true # Detect scene boundaries
keyframe_detection: false
tracking: false
pre_annotate: true # Auto-segment entire video
hint: true
ai_support:
enabled: true
endpoint_type: "ollama_vision"
ai_config:
model: "llava:latest"
max_frames: 10 # Frames to sample for video analysisSeparate Visual and Text Endpoints
You can configure a separate endpoint for visual tasks, using the best model for each content type:
ai_support:
enabled: true
endpoint_type: "openai" # For text annotations
ai_config:
api_key: "${OPENAI_API_KEY}"
model: "gpt-4o-mini"
# Separate visual endpoint
visual_endpoint_type: "yolo"
visual_ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5Or using a vision-language model alongside a text model:
ai_support:
enabled: true
endpoint_type: "ollama" # Main endpoint for text
visual_endpoint_type: "ollama_vision" # Visual endpoint for images
ai_config:
model: "llama3.2"
include:
all: true
visual_ai_config:
model: "qwen2.5-vl:7b"AI Features
Detection
Finds objects matching the configured labels and draws suggestion bounding boxes. Suggestions appear as dashed overlays that can be accepted or rejected.
Pre-annotation (Auto)
Automatically detects all objects in the image/video and creates suggestions for human review. Useful for speeding up annotation of large datasets.
Classification
Classifies a selected region or the entire image. Returns a suggested label with confidence score and reasoning.
Hints
Provides guidance without revealing exact answers. Good for training annotators or when you want human judgment with AI assistance.
Scene Detection (Video)
Analyzes video frames to identify scene boundaries and suggests temporal segments with labels.
Keyframe Detection (Video)
Identifies significant moments in a video that would make good annotation points.
Object Tracking (Video)
Suggests object positions across frames for consistent tracking annotation.
Using AI Suggestions
- Click the AI assistance button (Detect, Auto, Hint, etc.)
- Wait for suggestions to appear as dashed overlays
- Accept a suggestion: Double-click the suggestion overlay
- Reject a suggestion: Right-click the suggestion overlay
- Accept all: Click "Accept All" in the toolbar
- Clear all: Click "Clear" to remove all suggestions
Detection API Response Format
{
"detections": [
{
"label": "person",
"bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.5},
"confidence": 0.95
}
]
}For hints:
{
"hint": "Look for objects in the lower right corner",
"suggestive_choice": "Focus on overlapping regions"
}For video segments:
{
"segments": [
{
"start_time": 0.0,
"end_time": 5.5,
"suggested_label": "intro",
"confidence": 0.85
}
]
}Requirements
For YOLO endpoint
pip install ultralytics opencv-pythonFor Ollama Vision
- Install Ollama from ollama.ai
- Pull a vision model:
ollama pull llava - Start Ollama server (runs on
http://localhost:11434by default)
For OpenAI/Anthropic Vision
- Set API key in environment or config
- Ensure you have access to vision-capable models
Troubleshooting
"No visual AI endpoint configured"
Ensure you have:
- Set
ai_support.enabled: true - Set a valid
endpoint_typethat supports vision (yolo,ollama_vision,openai_vision,anthropic_vision) - Installed required dependencies for your chosen endpoint
YOLO not detecting expected objects
- Try lowering
confidence_threshold - Ensure your labels match YOLO's class names (or use YOLO-World for custom vocabularies)
- Check that the model file exists and is valid
Ollama Vision errors
- Verify Ollama is running:
curl http://localhost:11434/api/tags - Ensure you've pulled a vision model:
ollama list - Check model supports vision (llava, bakllava, llama3.2-vision, etc.)
Further Reading
- AI Support - Text-based AI assistance (hints, keywords, rationales)
- Image Annotation - Image annotation tools and configuration
- Instance Display - Configure content display
For implementation details, see the source documentation.