# Visual AI Support

Source: https://www.potatoannotator.com/docs/features/visual-ai-support

*New in v2.1.0*

Potato provides AI-powered assistance for image and video annotation tasks using various vision models including YOLO for object detection and vision-language models (VLLMs) like GPT-4o, Claude, and Ollama vision models.

## Overview

Visual AI support enables:

- **Object Detection**: Automatically detect and locate objects in images using YOLO or VLLMs
- **Pre-annotation**: Auto-detect all objects for human review
- **Classification**: Classify images or regions within images
- **Hints**: Get guidance without revealing exact locations
- **Scene Detection**: Identify temporal segments in videos
- **Keyframe Detection**: Find significant moments in videos
- **Object Tracking**: Track objects across video frames

## Supported Endpoints

### YOLO Endpoint

Best for fast, accurate object detection using local inference.

```yaml
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"  # or yolov8n, yolov8l, yolov8x, yolo-world
    confidence_threshold: 0.5
    iou_threshold: 0.45
```

Supported models:
- YOLOv8 (n/s/m/l/x variants)
- YOLO-World (open-vocabulary detection)
- Custom trained models

### Ollama Vision Endpoint

For local vision-language model inference.

```yaml
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"  # or llava-llama3, bakllava, llama3.2-vision, qwen2.5-vl
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1
```

Supported models:
- LLaVA (7B, 13B, 34B)
- LLaVA-LLaMA3
- BakLLaVA
- Llama 3.2 Vision (11B, 90B)
- Qwen2.5-VL
- Moondream

### OpenAI Vision Endpoint

For cloud-based vision analysis using GPT-4o.

```yaml
ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"  # or gpt-4o-mini
    max_tokens: 1000
    detail: "auto"  # low, high, or auto
```

### Anthropic Vision Endpoint

For Claude with vision capabilities.

```yaml
ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024
```

## Endpoint Capabilities

Each endpoint has different strengths:

| Endpoint | Text Gen | Vision | Bbox Output | Keyword | Rationale |
|----------|----------|--------|-------------|---------|-----------|
| `ollama_vision` | Yes | Yes | No | No | Yes |
| `openai_vision` | Yes | Yes | No | No | Yes |
| `anthropic_vision` | Yes | Yes | No | No | Yes |
| `yolo` | No | Yes | Yes | No | No |

**Best practices:**
- For **precise object detection**, use the `yolo` endpoint
- For **image classification with explanations**, use a VLLM like `ollama_vision` with Qwen-VL or LLaVA
- For **combined workflows**, configure both a text endpoint and a visual endpoint

## Image Annotation with AI

Configure AI-assisted image annotation with detection, pre-annotation, classification, and hint features:

```yaml
annotation_schemes:
  - annotation_type: image_annotation
    name: object_detection
    description: "Detect and label objects in the image"
    tools:
      - bbox
      - polygon
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"

    ai_support:
      enabled: true
      features:
        detection: true      # "Detect" button - find objects
        pre_annotate: true   # "Auto" button - detect all
        classification: false # "Classify" button - classify region
        hint: true           # "Hint" button - get guidance

ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
```

## Video Annotation with AI

```yaml
annotation_schemes:
  - annotation_type: video_annotation
    name: scene_segmentation
    description: "Segment video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "action"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"

    ai_support:
      enabled: true
      features:
        scene_detection: true     # Detect scene boundaries
        keyframe_detection: false
        tracking: false
        pre_annotate: true        # Auto-segment entire video
        hint: true

ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Frames to sample for video analysis
```

## Separate Visual and Text Endpoints

You can configure a separate endpoint for visual tasks, using the best model for each content type:

```yaml
ai_support:
  enabled: true
  endpoint_type: "openai"  # For text annotations
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o-mini"

  # Separate visual endpoint
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
```

Or using a vision-language model alongside a text model:

```yaml
ai_support:
  enabled: true
  endpoint_type: "ollama"  # Main endpoint for text
  visual_endpoint_type: "ollama_vision"  # Visual endpoint for images
  ai_config:
    model: "llama3.2"
    include:
      all: true
  visual_ai_config:
    model: "qwen2.5-vl:7b"
```

## AI Features

### Detection

Finds objects matching the configured labels and draws suggestion bounding boxes. Suggestions appear as dashed overlays that can be accepted or rejected.

### Pre-annotation (Auto)

Automatically detects all objects in the image/video and creates suggestions for human review. Useful for speeding up annotation of large datasets.

### Classification

Classifies a selected region or the entire image. Returns a suggested label with confidence score and reasoning.

### Hints

Provides guidance without revealing exact answers. Good for training annotators or when you want human judgment with AI assistance.

### Scene Detection (Video)

Analyzes video frames to identify scene boundaries and suggests temporal segments with labels.

### Keyframe Detection (Video)

Identifies significant moments in a video that would make good annotation points.

### Object Tracking (Video)

Suggests object positions across frames for consistent tracking annotation.

## Using AI Suggestions

1. Click the AI assistance button (Detect, Auto, Hint, etc.)
2. Wait for suggestions to appear as dashed overlays
3. **Accept a suggestion**: Double-click the suggestion overlay
4. **Reject a suggestion**: Right-click the suggestion overlay
5. **Accept all**: Click "Accept All" in the toolbar
6. **Clear all**: Click "Clear" to remove all suggestions

## Detection API Response Format

```json
{
  "detections": [
    {
      "label": "person",
      "bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.5},
      "confidence": 0.95
    }
  ]
}
```

For hints:

```json
{
  "hint": "Look for objects in the lower right corner",
  "suggestive_choice": "Focus on overlapping regions"
}
```

For video segments:

```json
{
  "segments": [
    {
      "start_time": 0.0,
      "end_time": 5.5,
      "suggested_label": "intro",
      "confidence": 0.85
    }
  ]
}
```

## Requirements

### For YOLO endpoint

```bash
pip install ultralytics opencv-python
```

### For Ollama Vision

1. Install Ollama from [ollama.ai](https://ollama.ai)
2. Pull a vision model: `ollama pull llava`
3. Start Ollama server (runs on `http://localhost:11434` by default)

### For OpenAI/Anthropic Vision

- Set API key in environment or config
- Ensure you have access to vision-capable models

## Troubleshooting

### "No visual AI endpoint configured"

Ensure you have:
1. Set `ai_support.enabled: true`
2. Set a valid `endpoint_type` that supports vision (`yolo`, `ollama_vision`, `openai_vision`, `anthropic_vision`)
3. Installed required dependencies for your chosen endpoint

### YOLO not detecting expected objects

- Try lowering `confidence_threshold`
- Ensure your labels match YOLO's class names (or use YOLO-World for custom vocabularies)
- Check that the model file exists and is valid

### Ollama Vision errors

- Verify Ollama is running: `curl http://localhost:11434/api/tags`
- Ensure you've pulled a vision model: `ollama list`
- Check model supports vision (llava, bakllava, llama3.2-vision, etc.)

## Further Reading

- [AI Support](/docs/features/ai-support) - Text-based AI assistance (hints, keywords, rationales)
- [Image Annotation](/docs/features/image-annotation) - Image annotation tools and configuration
- [Instance Display](/docs/core-concepts/instance-display) - Configure content display

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/visual_ai_support.md).