Using Visual AI to Speed Up Image and Video Annotation
Set up AI-powered object detection, pre-annotation, and classification for image and video tasks with YOLO, Ollama, OpenAI, and Claude.
Using Visual AI to Speed Up Image and Video Annotation
Potato 2.1 introduces visual AI support that brings AI-powered assistance directly into image and video annotation workflows. Instead of annotating every bounding box from scratch, you can have YOLO detect objects automatically and then review its suggestions, or ask a vision-language model to classify images and explain its reasoning.
This guide walks through setting up each visual AI endpoint, configuring the different assistance modes, and combining visual AI with Potato's text-based AI features.
What You'll Learn
- Setting up YOLO for fast local object detection
- Running Ollama Vision models for local image understanding
- Using OpenAI and Anthropic cloud vision APIs
- Configuring detection, pre-annotation, classification, and hint modes
- Combining visual and text AI endpoints in a single project
- The accept/reject workflow for reviewing AI suggestions
Prerequisites
You'll need Potato 2.1.0 or later:
pip install --upgrade potato-annotationAnd depending on which endpoint you choose, you'll need one of these:
- YOLO:
pip install ultralytics opencv-python - Ollama: Install from ollama.ai and pull a vision model
- OpenAI: An API key with access to GPT-4o
- Anthropic: An API key with access to Claude vision models
Option 1: YOLO for Object Detection
YOLO is the best choice when you need fast, precise bounding box detection running entirely on your local machine. It excels at detecting common objects (people, cars, animals, furniture) and can process images in milliseconds.
Setup
pip install ultralytics opencv-pythonConfiguration
annotation_task_name: "Object Detection with YOLO"
data_files:
- data/images.json
item_properties:
id_key: id
text_key: image_url
instance_display:
fields:
- key: image_url
type: image
display_options:
max_width: 800
zoomable: true
annotation_schemes:
- annotation_type: image_annotation
name: objects
description: "Detect and label objects"
source_field: "image_url"
tools:
- bbox
labels:
- name: "person"
color: "#FF6B6B"
- name: "car"
color: "#4ECDC4"
- name: "dog"
color: "#45B7D1"
- name: "cat"
color: "#96CEB4"
ai_support:
enabled: true
features:
detection: true
pre_annotate: true
hint: true
ai_support:
enabled: true
endpoint_type: "yolo"
ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5
iou_threshold: 0.45
output_annotation_dir: "annotation_output/"
user_config:
allow_all_users: trueData Format
Create data/images.json in JSONL format (one JSON object per line):
{"id": "img_001", "image_url": "images/street_scene_1.jpg"}
{"id": "img_002", "image_url": "images/park_photo.jpg"}
{"id": "img_003", "image_url": "https://example.com/images/office.jpg"}Choosing a YOLO Model
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
yolov8n.pt | 6 MB | Fastest | Lower | Quick prototyping |
yolov8s.pt | 22 MB | Fast | Good | Balanced workloads |
yolov8m.pt | 50 MB | Medium | Better | General use |
yolov8l.pt | 84 MB | Slower | High | When accuracy matters |
yolov8x.pt | 131 MB | Slowest | Highest | Maximum precision |
For detecting objects not in YOLO's built-in classes, use YOLO-World for open-vocabulary detection:
ai_config:
model: "yolo-world"
confidence_threshold: 0.3Tuning Detection
If YOLO is missing objects, lower the confidence threshold:
ai_config:
confidence_threshold: 0.3 # More detections, more false positivesIf you're getting too many false positives, raise it:
ai_config:
confidence_threshold: 0.7 # Fewer detections, higher precisionOption 2: Ollama Vision for Local VLLMs
Ollama Vision gives you the power of vision-language models running locally. Unlike YOLO, these models can understand image context, classify scenes, and generate textual explanations — all without sending data to a cloud API.
Setup
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a vision model
ollama pull llava
# Or for better performance:
ollama pull qwen2.5-vl:7bConfiguration
annotation_task_name: "Image Classification with Ollama Vision"
data_files:
- data/images.json
item_properties:
id_key: id
text_key: image_url
instance_display:
fields:
- key: image_url
type: image
display_options:
max_width: 600
zoomable: true
annotation_schemes:
- annotation_type: radio
name: scene_type
description: "What type of scene is shown?"
labels:
- indoor
- outdoor_urban
- outdoor_nature
- aerial
- underwater
ai_support:
enabled: true
features:
hint: true
classification: true
ai_support:
enabled: true
endpoint_type: "ollama_vision"
ai_config:
model: "llava:latest"
base_url: "http://localhost:11434"
max_tokens: 500
temperature: 0.1
output_annotation_dir: "annotation_output/"
user_config:
allow_all_users: trueSupported Models
| Model | Parameters | Strengths |
|---|---|---|
llava:7b | 7B | Fast, good general understanding |
llava:13b | 13B | Better accuracy |
llava-llama3 | 8B | Strong reasoning |
bakllava | 7B | Good visual detail |
llama3.2-vision:11b | 11B | Latest Llama vision |
qwen2.5-vl:7b | 7B | Strong multilingual + vision |
moondream | 1.8B | Very fast, lightweight |
Option 3: OpenAI Vision
OpenAI Vision provides high-quality image understanding through GPT-4o. Best when you need the most capable vision model and don't mind cloud API costs.
Configuration
ai_support:
enabled: true
endpoint_type: "openai_vision"
ai_config:
api_key: "${OPENAI_API_KEY}"
model: "gpt-4o"
max_tokens: 1000
detail: "auto" # "low" for faster/cheaper, "high" for detailSet your API key:
export OPENAI_API_KEY="sk-..."The detail parameter controls image resolution sent to the API:
low— Faster and cheaper, good for classificationhigh— Full resolution, better for finding small objectsauto— Let the API decide
Option 4: Anthropic Vision
Claude's vision capabilities are strong at understanding image context and providing detailed explanations.
Configuration
ai_support:
enabled: true
endpoint_type: "anthropic_vision"
ai_config:
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-sonnet-4-20250514"
max_tokens: 1024export ANTHROPIC_API_KEY="sk-ant-..."AI Assistance Modes
Each visual AI endpoint supports different assistance modes. Enable only the ones you need per annotation scheme.
Detection Mode
Finds objects matching your configured labels and shows them as dashed bounding box overlays:
ai_support:
enabled: true
features:
detection: trueThe annotator clicks "Detect", and AI suggestions appear as dashed overlays on the image. Double-click to accept, right-click to reject.
Pre-annotation (Auto) Mode
Automatically detects all objects and creates suggestions in one pass. Best for bootstrapping large datasets:
ai_support:
enabled: true
features:
pre_annotate: trueClassification Mode
Classifies a selected region or the entire image, returning a suggested label with a confidence score:
ai_support:
enabled: true
features:
classification: trueHint Mode
Provides guidance text without giving away the answer. Good for training new annotators:
ai_support:
enabled: true
features:
hint: trueThe Accept/Reject Workflow
When an annotator clicks an AI assistance button, suggestions appear as dashed overlays:
- Accept a suggestion — Double-click the dashed overlay to convert it into a real annotation
- Reject a suggestion — Right-click the overlay to dismiss it
- Accept all — Click "Accept All" in the toolbar to accept every suggestion at once
- Clear all — Click "Clear" to dismiss all suggestions
This keeps annotators in control while reducing the manual work of drawing boxes from scratch.
Video Annotation with Visual AI
Visual AI also works with video annotation tasks. You can enable scene detection, keyframe detection, and object tracking:
annotation_schemes:
- annotation_type: video_annotation
name: scenes
description: "Segment this video into scenes"
mode: segment
labels:
- name: "intro"
color: "#4ECDC4"
- name: "main_content"
color: "#FF6B6B"
- name: "outro"
color: "#45B7D1"
ai_support:
enabled: true
features:
scene_detection: true
pre_annotate: true
hint: true
ai_support:
enabled: true
endpoint_type: "ollama_vision"
ai_config:
model: "llava:latest"
max_frames: 10 # Number of frames to sampleThe max_frames parameter controls how many frames the AI samples from the video for analysis. More frames means better accuracy but slower processing.
Combining Visual and Text AI Endpoints
If your project has both text and image annotation, you can configure separate endpoints for each. Use a text-optimized model for hints and keywords, and a vision model for detection:
ai_support:
enabled: true
# Text AI for radio buttons, text schemes, etc.
endpoint_type: "ollama"
ai_config:
model: "llama3.2"
include:
all: true
# Visual AI for image/video schemes
visual_endpoint_type: "yolo"
visual_ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5Or use a cloud vision model alongside a local text model:
ai_support:
enabled: true
endpoint_type: "ollama"
visual_endpoint_type: "openai_vision"
ai_config:
model: "llama3.2"
visual_ai_config:
api_key: "${OPENAI_API_KEY}"
model: "gpt-4o"Complete Example: Product Photo Annotation
Here's a production-ready configuration for annotating product photos with YOLO detection and text-based AI hints:
annotation_task_name: "Product Photo Annotation"
data_files:
- data/product_photos.json
item_properties:
id_key: sku
text_key: photo_url
instance_display:
layout:
direction: horizontal
gap: 24px
fields:
- key: photo_url
type: image
label: "Product Photo"
display_options:
max_width: 600
zoomable: true
- key: product_description
type: text
label: "Product Details"
annotation_schemes:
- annotation_type: image_annotation
name: product_regions
description: "Draw boxes around products and defects"
source_field: "photo_url"
tools:
- bbox
labels:
- name: "product"
color: "#4ECDC4"
- name: "defect"
color: "#FF6B6B"
- name: "label"
color: "#45B7D1"
- name: "packaging"
color: "#96CEB4"
ai_support:
enabled: true
features:
detection: true
pre_annotate: true
- annotation_type: radio
name: photo_quality
description: "Is this photo suitable for the product listing?"
labels:
- Approved
- Needs editing
- Reshoot required
- annotation_type: multiselect
name: quality_issues
description: "Select any issues present"
labels:
- Blurry
- Poor lighting
- Wrong angle
- Background clutter
- Color inaccurate
ai_support:
enabled: true
endpoint_type: "ollama"
visual_endpoint_type: "yolo"
ai_config:
model: "llama3.2"
include:
all: true
visual_ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
user_config:
allow_all_users: trueSample data (data/product_photos.json):
{"sku": "SKU-001", "photo_url": "images/products/laptop_front.jpg", "product_description": "15-inch laptop, silver finish"}
{"sku": "SKU-002", "photo_url": "images/products/headphones_side.jpg", "product_description": "Over-ear wireless headphones, black"}
{"sku": "SKU-003", "photo_url": "images/products/backpack_full.jpg", "product_description": "40L hiking backpack, navy blue"}Tips for Visual AI Annotation
- Start with pre-annotation for large datasets — Use the Auto button to generate suggestions for all objects, then have annotators review and correct rather than drawing from scratch
- Match the endpoint to your task — YOLO for precise detection, VLLMs for classification and understanding
- Tune confidence thresholds — Start at 0.5 and adjust based on the false positive/negative trade-off you see
- Use hints for annotator training — The hint mode guides annotators without biasing them toward a specific answer
- Combine endpoints — A YOLO visual endpoint for detection plus an Ollama text endpoint for hints gives you the best of both worlds
- Cache AI results — Enable disk caching to avoid re-running detection on the same images
Troubleshooting
"No visual AI endpoint configured"
Make sure ai_support.enabled is true and you've set an endpoint_type that supports vision: yolo, ollama_vision, openai_vision, or anthropic_vision.
YOLO not detecting your objects
YOLO's built-in classes cover 80 common objects. If your labels don't match YOLO's class names, try YOLO-World for open-vocabulary detection, or lower the confidence_threshold.
Ollama returning errors
Verify Ollama is running and you've pulled a vision model:
curl http://localhost:11434/api/tags # Check Ollama is running
ollama list # Check installed modelsSlow response from cloud APIs
Enable caching so the same image isn't analyzed twice:
ai_support:
cache_config:
disk_cache:
enabled: true
path: "ai_cache/visual_cache.json"Next Steps
- Read the full Visual AI Support documentation for API reference details
- Set up instance display to show images alongside other content types
- Explore text-based AI support for hints and keyword highlighting
Full documentation at /docs/features/visual-ai-support.