# Using Visual AI to Speed Up Image and Video Annotation

Source: https://www.potatoannotator.com/blog/visual-ai-annotation-guide

Potato 2.1 adds visual AI support, which puts AI assistance right inside image and video annotation. Instead of drawing every bounding box by hand, you can let YOLO detect objects and review what it found, or hand an image to a vision-language model and have it classify the image and explain why.

This guide walks through setting up each visual AI endpoint, the assistance modes, and how to pair visual AI with Potato's text-based AI. The [visual AI source documentation](https://github.com/davidjurgens/potato/blob/master/docs/ai-intelligence/visual_ai_support.md) has the full API reference.

## What you'll learn

- Setting up YOLO for fast local object detection
- Running Ollama Vision models for local image understanding
- Using OpenAI and Anthropic cloud vision APIs
- Configuring detection, pre-annotation, classification, and hint modes
- Combining visual and text AI endpoints in a single project
- The accept/reject workflow for reviewing AI suggestions

## Prerequisites

You'll need Potato 2.1.0 or later:

```bash
pip install --upgrade potato-annotation
```

And depending on which endpoint you choose, you'll need one of these:

- **YOLO**: `pip install ultralytics opencv-python`
- **Ollama**: Install from [ollama.ai](https://ollama.ai) and pull a vision model
- **OpenAI**: An API key with access to GPT-4o
- **Anthropic**: An API key with access to Claude vision models

## Option 1: YOLO for Object Detection

Reach for YOLO when you want fast, precise bounding box detection that runs entirely on your own machine. It's good at common objects (people, cars, animals, furniture) and processes images in milliseconds.

### Setup

```bash
pip install ultralytics opencv-python
```

### Configuration

```yaml
annotation_task_name: "Object Detection with YOLO"

data_files:
  - data/images.json

item_properties:
  id_key: id
  text_key: image_url

instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 800
        zoomable: true

annotation_schemes:
  - annotation_type: image_annotation
    name: objects
    description: "Detect and label objects"
    source_field: "image_url"
    tools:
      - bbox
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"
      - name: "cat"
        color: "#96CEB4"

    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
        hint: true

ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
    iou_threshold: 0.45

output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true
```

### Data Format

Create `data/images.json` in JSONL format (one JSON object per line):

```json
{"id": "img_001", "image_url": "images/street_scene_1.jpg"}
{"id": "img_002", "image_url": "images/park_photo.jpg"}
{"id": "img_003", "image_url": "https://example.com/images/office.jpg"}
```

### Choosing a YOLO Model

| Model | Size | Speed | Accuracy | Best For |
|-------|------|-------|----------|----------|
| `yolov8n.pt` | 6 MB | Fastest | Lower | Quick prototyping |
| `yolov8s.pt` | 22 MB | Fast | Good | Balanced workloads |
| `yolov8m.pt` | 50 MB | Medium | Better | General use |
| `yolov8l.pt` | 84 MB | Slower | High | When accuracy matters |
| `yolov8x.pt` | 131 MB | Slowest | Highest | Maximum precision |

For detecting objects not in YOLO's built-in classes, use **YOLO-World** for open-vocabulary detection:

```yaml
ai_config:
  model: "yolo-world"
  confidence_threshold: 0.3
```

### Tuning Detection

If YOLO is missing objects, lower the confidence threshold:

```yaml
ai_config:
  confidence_threshold: 0.3  # More detections, more false positives
```

If you're getting too many false positives, raise it:

```yaml
ai_config:
  confidence_threshold: 0.7  # Fewer detections, higher precision
```

## Option 2: Ollama Vision for Local VLLMs

Ollama Vision runs vision-language models locally. Unlike YOLO, these models read image context, classify scenes, and write out explanations, and none of your data leaves the machine.

### Setup

```bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a vision model
ollama pull llava

# Or for better performance:
ollama pull qwen2.5-vl:7b
```

### Configuration

```yaml
annotation_task_name: "Image Classification with Ollama Vision"

data_files:
  - data/images.json

item_properties:
  id_key: id
  text_key: image_url

instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 600
        zoomable: true

annotation_schemes:
  - annotation_type: radio
    name: scene_type
    description: "What type of scene is shown?"
    labels:
      - indoor
      - outdoor_urban
      - outdoor_nature
      - aerial
      - underwater

    ai_support:
      enabled: true
      features:
        hint: true
        classification: true

ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1

output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true
```

### Supported Models

| Model | Parameters | Strengths |
|-------|-----------|-----------|
| `llava:7b` | 7B | Fast, good general understanding |
| `llava:13b` | 13B | Better accuracy |
| `llava-llama3` | 8B | Strong reasoning |
| `bakllava` | 7B | Good visual detail |
| `llama3.2-vision:11b` | 11B | Latest Llama vision |
| `qwen2.5-vl:7b` | 7B | Strong multilingual + vision |
| `moondream` | 1.8B | Very fast, lightweight |

## Option 3: OpenAI Vision

OpenAI Vision gives you image understanding through GPT-4o. Use it when you want the strongest vision model and don't mind paying for cloud API calls.

### Configuration

```yaml
ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"
    max_tokens: 1000
    detail: "auto"  # "low" for faster/cheaper, "high" for detail
```

Set your API key:

```bash
export OPENAI_API_KEY="sk-..."
```

The `detail` parameter controls image resolution sent to the API:
- `low`, Faster and cheaper, good for classification
- `high`, Full resolution, better for finding small objects
- `auto`, Let the API decide

## Option 4: Anthropic Vision

Claude's vision models are good at reading image context and writing detailed explanations.

### Configuration

```yaml
ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024
```

```bash
export ANTHROPIC_API_KEY="sk-ant-..."
```

## AI assistance modes

Each visual AI endpoint supports a few assistance modes. Turn on only the ones you need, per annotation scheme.

### Detection mode

Finds objects matching your configured labels and shows them as dashed bounding box overlays:

```yaml
ai_support:
  enabled: true
  features:
    detection: true
```

The annotator clicks "Detect" and the AI's suggestions show up as dashed overlays on the image. Double-click to accept, right-click to reject.

### Pre-annotation (auto) mode

Detects all objects and creates suggestions in one pass. Best for bootstrapping large datasets:

```yaml
ai_support:
  enabled: true
  features:
    pre_annotate: true
```

### Classification mode

Classifies a selected region or the whole image and returns a suggested label with a confidence score:

```yaml
ai_support:
  enabled: true
  features:
    classification: true
```

### Hint mode

Gives guidance text without giving away the answer. Good for training new annotators:

```yaml
ai_support:
  enabled: true
  features:
    hint: true
```

## The accept/reject workflow

When an annotator clicks an AI assistance button, the suggestions appear as dashed overlays:

1. **Accept a suggestion**: Double-click the dashed overlay to convert it into a real annotation
2. **Reject a suggestion**: Right-click the overlay to dismiss it
3. **Accept all**: Click "Accept All" in the toolbar to accept every suggestion at once
4. **Clear all**: Click "Clear" to dismiss all suggestions

The annotator stays in control, but draws far fewer boxes by hand.

## Video annotation with visual AI

Visual AI works on video tasks too. You can turn on scene detection, keyframe detection, and object tracking:

```yaml
annotation_schemes:
  - annotation_type: video_annotation
    name: scenes
    description: "Segment this video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "main_content"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"

    ai_support:
      enabled: true
      features:
        scene_detection: true
        pre_annotate: true
        hint: true

ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Number of frames to sample
```

The `max_frames` parameter sets how many frames the AI samples from the video. More frames means better accuracy and slower processing.

## Combining visual and text AI endpoints

If your project mixes text and image annotation, you can set up a separate endpoint for each: a text model for hints and keywords, a vision model for detection.

```yaml
ai_support:
  enabled: true

  # Text AI for radio buttons, text schemes, etc.
  endpoint_type: "ollama"
  ai_config:
    model: "llama3.2"
    include:
      all: true

  # Visual AI for image/video schemes
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
```

Or use a cloud vision model alongside a local text model:

```yaml
ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "openai_vision"
  ai_config:
    model: "llama3.2"
  visual_ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"
```

## Complete example: product photo annotation

Here's a configuration you could ship for annotating product photos, with YOLO detection and text-based AI hints:

```yaml
annotation_task_name: "Product Photo Annotation"

data_files:
  - data/product_photos.json

item_properties:
  id_key: sku
  text_key: photo_url

instance_display:
  layout:
    direction: horizontal
    gap: 24px
  fields:
    - key: photo_url
      type: image
      label: "Product Photo"
      display_options:
        max_width: 600
        zoomable: true
    - key: product_description
      type: text
      label: "Product Details"

annotation_schemes:
  - annotation_type: image_annotation
    name: product_regions
    description: "Draw boxes around products and defects"
    source_field: "photo_url"
    tools:
      - bbox
    labels:
      - name: "product"
        color: "#4ECDC4"
      - name: "defect"
        color: "#FF6B6B"
      - name: "label"
        color: "#45B7D1"
      - name: "packaging"
        color: "#96CEB4"

    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true

  - annotation_type: radio
    name: photo_quality
    description: "Is this photo suitable for the product listing?"
    labels:
      - Approved
      - Needs editing
      - Reshoot required

  - annotation_type: multiselect
    name: quality_issues
    description: "Select any issues present"
    labels:
      - Blurry
      - Poor lighting
      - Wrong angle
      - Background clutter
      - Color inaccurate

ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "yolo"

  ai_config:
    model: "llama3.2"
    include:
      all: true

  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

output_annotation_dir: "annotation_output/"
export_annotation_format: "json"
user_config:
  allow_all_users: true
```

Sample data (`data/product_photos.json`):

```json
{"sku": "SKU-001", "photo_url": "images/products/laptop_front.jpg", "product_description": "15-inch laptop, silver finish"}
{"sku": "SKU-002", "photo_url": "images/products/headphones_side.jpg", "product_description": "Over-ear wireless headphones, black"}
{"sku": "SKU-003", "photo_url": "images/products/backpack_full.jpg", "product_description": "40L hiking backpack, navy blue"}
```

## Tips for visual AI annotation

1. **Start with pre-annotation for large datasets**: hit the Auto button to generate suggestions for everything, then have annotators review and correct instead of drawing from scratch
2. **Match the endpoint to your task**: YOLO for precise detection, VLLMs for classification and understanding
3. **Tune confidence thresholds**: start at 0.5 and move it based on the false positive/negative trade-off you're seeing
4. **Use hints for annotator training**: hint mode guides annotators without steering them toward a specific answer
5. **Combine endpoints**: a YOLO endpoint for detection plus an Ollama text endpoint for hints covers both sides
6. **Cache AI results**: turn on disk caching so you don't re-run detection on the same images

## Troubleshooting

### "No visual AI endpoint configured"

Make sure `ai_support.enabled` is `true` and you've set an `endpoint_type` that supports vision: `yolo`, `ollama_vision`, `openai_vision`, or `anthropic_vision`.

### YOLO not detecting your objects

YOLO's built-in classes cover 80 common objects. If your labels don't match YOLO's class names, try YOLO-World for open-vocabulary detection, or lower the `confidence_threshold`.

### Ollama returning errors

Verify Ollama is running and you've pulled a vision model:

```bash
curl http://localhost:11434/api/tags  # Check Ollama is running
ollama list                           # Check installed models
```

### Slow response from cloud APIs

Enable caching so the same image isn't analyzed twice:

```yaml
ai_support:
  cache_config:
    disk_cache:
      enabled: true
      path: "ai_cache/visual_cache.json"
```

## Next Steps

- Read the full [Visual AI Support documentation](/docs/features/visual-ai-support) for API reference details
- Set up [instance display](/docs/core-concepts/instance-display) to show images alongside other content types
- Explore [text-based AI support](/blog/llm-integration-guide) for hints and keyword highlighting

---

*Full documentation at [/docs/features/visual-ai-support](/docs/features/visual-ai-support).*