Image और Video Annotation को तेज़ करने के लिए Visual AI का उपयोग

Potato 2.1 visual AI support प्रस्तुत करता है जो AI-powered assistance को सीधे image और video annotation workflows में लाता है। हर bounding box को scratch से annotate करने के बजाय, आप YOLO को automatically objects detect करवा सकते हैं और फिर उसके suggestions की समीक्षा कर सकते हैं, या किसी vision-language model से images classify करने और उसके reasoning को explain करने के लिए कह सकते हैं।

यह गाइड प्रत्येक visual AI endpoint सेट करने, विभिन्न assistance modes configure करने, और Potato की text-based AI features के साथ visual AI को combine करने की प्रक्रिया बताती है।

आप क्या सीखेंगे

Fast local object detection के लिए YOLO सेट करना
Local image understanding के लिए Ollama Vision models चलाना
OpenAI और Anthropic cloud vision APIs का उपयोग करना
Detection, pre-annotation, classification, और hint modes configure करना
एक single project में visual और text AI endpoints combine करना
AI suggestions review करने के लिए accept/reject workflow

पूर्व आवश्यकताएँ

आपको Potato 2.1.0 या बाद का संस्करण चाहिए:

bash

pip install --upgrade potato-annotation

और आपने जो endpoint choose किया है उसके आधार पर, आपको इनमें से एक की आवश्यकता होगी:

YOLO: pip install ultralytics opencv-python
Ollama: ollama.ai से install करें और एक vision model pull करें
OpenAI: GPT-4o तक access के साथ एक API key
Anthropic: Claude vision models तक access के साथ एक API key

Option 1: Object Detection के लिए YOLO

YOLO सबसे अच्छा विकल्प है जब आपको आपकी local machine पर पूरी तरह चलने वाला fast, precise bounding box detection चाहिए। यह common objects (लोग, कारें, जानवर, फर्नीचर) detect करने में उत्कृष्ट है और milliseconds में images process कर सकता है।

Setup

bash

pip install ultralytics opencv-python

Configuration

yaml

annotation_task_name: "Object Detection with YOLO"
 
data_files:
  - data/images.json
 
item_properties:
  id_key: id
  text_key: image_url
 
instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 800
        zoomable: true
 
annotation_schemes:
  - annotation_type: image_annotation
    name: objects
    description: "Detect and label objects"
    source_field: "image_url"
    tools:
      - bbox
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
      - name: "dog"
        color: "#45B7D1"
      - name: "cat"
        color: "#96CEB4"
 
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
    iou_threshold: 0.45
 
output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true

Data Format

JSONL format में data/images.json बनाएं (प्रति पंक्ति एक JSON object):

json

{"id": "img_001", "image_url": "images/street_scene_1.jpg"}
{"id": "img_002", "image_url": "images/park_photo.jpg"}
{"id": "img_003", "image_url": "https://example.com/images/office.jpg"}

YOLO Model चुनना

Model	Size	Speed	Accuracy	सबसे अच्छा
`yolov8n.pt`	6 MB	सबसे तेज़	कम	Quick prototyping
`yolov8s.pt`	22 MB	तेज़	अच्छी	Balanced workloads
`yolov8m.pt`	50 MB	मध्यम	बेहतर	General use
`yolov8l.pt`	84 MB	धीमा	उच्च	जब accuracy मायने रखती है
`yolov8x.pt`	131 MB	सबसे धीमा	सर्वोच्च	Maximum precision

YOLO के built-in classes में नहीं होने वाले objects detect करने के लिए, open-vocabulary detection के लिए YOLO-World का उपयोग करें:

yaml

ai_config:
  model: "yolo-world"
  confidence_threshold: 0.3

Detection Tune करना

यदि YOLO objects miss कर रहा है, confidence threshold कम करें:

yaml

ai_config:
  confidence_threshold: 0.3  # More detections, more false positives

यदि आपको बहुत अधिक false positives मिल रहे हैं, इसे बढ़ाएं:

yaml

ai_config:
  confidence_threshold: 0.7  # Fewer detections, higher precision

Option 2: Local VLLMs के लिए Ollama Vision

Ollama Vision आपको locally चलने वाले vision-language models की शक्ति देता है। YOLO के विपरीत, ये models image context समझ सकते हैं, scenes classify कर सकते हैं, और textual explanations generate कर सकते हैं -- सभी cloud API पर data भेजे बिना।

Setup

bash

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
 
# Pull a vision model
ollama pull llava
 
# Or for better performance:
ollama pull qwen2.5-vl:7b

Configuration

yaml

annotation_task_name: "Image Classification with Ollama Vision"
 
data_files:
  - data/images.json
 
item_properties:
  id_key: id
  text_key: image_url
 
instance_display:
  fields:
    - key: image_url
      type: image
      display_options:
        max_width: 600
        zoomable: true
 
annotation_schemes:
  - annotation_type: radio
    name: scene_type
    description: "What type of scene is shown?"
    labels:
      - indoor
      - outdoor_urban
      - outdoor_nature
      - aerial
      - underwater
 
    ai_support:
      enabled: true
      features:
        hint: true
        classification: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    base_url: "http://localhost:11434"
    max_tokens: 500
    temperature: 0.1
 
output_annotation_dir: "annotation_output/"
user_config:
  allow_all_users: true

Supported Models

Model	Parameters	Strengths
`llava:7b`	7B	तेज़, अच्छी general understanding
`llava:13b`	13B	बेहतर accuracy
`llava-llama3`	8B	Strong reasoning
`bakllava`	7B	अच्छा visual detail
`llama3.2-vision:11b`	11B	Latest Llama vision
`qwen2.5-vl:7b`	7B	Strong multilingual + vision
`moondream`	1.8B	बहुत तेज़, lightweight

Option 3: OpenAI Vision

OpenAI Vision GPT-4o के माध्यम से उच्च-गुणवत्ता वाली image understanding प्रदान करता है। सबसे अच्छा जब आपको सबसे capable vision model चाहिए और cloud API costs से कोई आपत्ति नहीं।

Configuration

yaml

ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"
    max_tokens: 1000
    detail: "auto"  # "low" for faster/cheaper, "high" for detail

अपना API key set करें:

bash

export OPENAI_API_KEY="sk-..."

detail parameter API को भेजे गए image resolution को नियंत्रित करता है:

low — तेज़ और सस्ता, classification के लिए अच्छा
high — Full resolution, छोटे objects खोजने के लिए बेहतर
auto — API को तय करने दें

Option 4: Anthropic Vision

Claude की vision capabilities image context समझने और detailed explanations प्रदान करने में मजबूत हैं।

Configuration

yaml

ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
    max_tokens: 1024

bash

export ANTHROPIC_API_KEY="sk-ant-..."

AI Assistance Modes

प्रत्येक visual AI endpoint विभिन्न assistance modes support करता है। केवल वे ही enable करें जो आपको प्रति annotation scheme चाहिए।

Detection Mode

आपके configured labels से मेल खाने वाले objects ढूंढता है और उन्हें dashed bounding box overlays के रूप में दिखाता है:

yaml

ai_support:
  enabled: true
  features:
    detection: true

Annotator "Detect" click करता है, और AI suggestions image पर dashed overlays के रूप में दिखाई देते हैं। Accept करने के लिए double-click, reject के लिए right-click करें।

Pre-annotation (Auto) Mode

सभी objects automatically detect करता है और एक pass में suggestions बनाता है। बड़े datasets bootstrap करने के लिए सबसे अच्छा:

yaml

ai_support:
  enabled: true
  features:
    pre_annotate: true

Classification Mode

एक selected region या पूरी image classify करता है, confidence score के साथ एक suggested label return करता है:

yaml

ai_support:
  enabled: true
  features:
    classification: true

Hint Mode

उत्तर बताए बिना guidance text प्रदान करता है। नए annotators को train करने के लिए अच्छा:

yaml

ai_support:
  enabled: true
  features:
    hint: true

Accept/Reject Workflow

जब कोई annotator किसी AI assistance button पर click करता है, तो suggestions dashed overlays के रूप में दिखाई देते हैं:

Suggestion accept करें — Real annotation में convert करने के लिए dashed overlay पर double-click करें
Suggestion reject करें — इसे dismiss करने के लिए overlay पर right-click करें
सभी accept करें — Toolbar में "Accept All" click करके हर suggestion accept करें
सब clear करें — सभी suggestions dismiss करने के लिए "Clear" click करें

यह annotators को control में रखता है जबकि scratch से boxes draw करने का manual काम कम करता है।

Visual AI के साथ Video Annotation

Visual AI video annotation tasks के साथ भी काम करता है। आप scene detection, keyframe detection, और object tracking enable कर सकते हैं:

yaml

annotation_schemes:
  - annotation_type: video_annotation
    name: scenes
    description: "Segment this video into scenes"
    mode: segment
    labels:
      - name: "intro"
        color: "#4ECDC4"
      - name: "main_content"
        color: "#FF6B6B"
      - name: "outro"
        color: "#45B7D1"
 
    ai_support:
      enabled: true
      features:
        scene_detection: true
        pre_annotate: true
        hint: true
 
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    max_frames: 10  # Number of frames to sample

max_frames parameter नियंत्रित करता है कि AI विश्लेषण के लिए video से कितने frames sample करता है। अधिक frames बेहतर accuracy का मतलब है लेकिन slower processing।

Visual और Text AI Endpoints को Combine करना

यदि आपके project में text और image annotation दोनों हैं, तो आप प्रत्येक के लिए अलग endpoints configure कर सकते हैं। Hints और keywords के लिए text-optimized model का उपयोग करें, और detection के लिए vision model:

yaml

ai_support:
  enabled: true
 
  # Text AI for radio buttons, text schemes, etc.
  endpoint_type: "ollama"
  ai_config:
    model: "llama3.2"
    include:
      all: true
 
  # Visual AI for image/video schemes
  visual_endpoint_type: "yolo"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

या local text model के साथ cloud vision model:

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "openai_vision"
  ai_config:
    model: "llama3.2"
  visual_ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"

Complete Example: Product Photo Annotation

YOLO detection और text-based AI hints के साथ product photos annotate करने के लिए production-ready configuration:

yaml

annotation_task_name: "Product Photo Annotation"
 
data_files:
  - data/product_photos.json
 
item_properties:
  id_key: sku
  text_key: photo_url
 
instance_display:
  layout:
    direction: horizontal
    gap: 24px
  fields:
    - key: photo_url
      type: image
      label: "Product Photo"
      display_options:
        max_width: 600
        zoomable: true
    - key: product_description
      type: text
      label: "Product Details"
 
annotation_schemes:
  - annotation_type: image_annotation
    name: product_regions
    description: "Draw boxes around products and defects"
    source_field: "photo_url"
    tools:
      - bbox
    labels:
      - name: "product"
        color: "#4ECDC4"
      - name: "defect"
        color: "#FF6B6B"
      - name: "label"
        color: "#45B7D1"
      - name: "packaging"
        color: "#96CEB4"
 
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
 
  - annotation_type: radio
    name: photo_quality
    description: "Is this photo suitable for the product listing?"
    labels:
      - Approved
      - Needs editing
      - Reshoot required
 
  - annotation_type: multiselect
    name: quality_issues
    description: "Select any issues present"
    labels:
      - Blurry
      - Poor lighting
      - Wrong angle
      - Background clutter
      - Color inaccurate
 
ai_support:
  enabled: true
  endpoint_type: "ollama"
  visual_endpoint_type: "yolo"
 
  ai_config:
    model: "llama3.2"
    include:
      all: true
 
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
 
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
user_config:
  allow_all_users: true

Sample data (data/product_photos.json):

json

{"sku": "SKU-001", "photo_url": "images/products/laptop_front.jpg", "product_description": "15-inch laptop, silver finish"}
{"sku": "SKU-002", "photo_url": "images/products/headphones_side.jpg", "product_description": "Over-ear wireless headphones, black"}
{"sku": "SKU-003", "photo_url": "images/products/backpack_full.jpg", "product_description": "40L hiking backpack, navy blue"}

Visual AI Annotation के लिए सुझाव

बड़े datasets के लिए pre-annotation से शुरू करें — सभी objects के लिए suggestions generate करने के लिए Auto button का उपयोग करें, फिर annotators को scratch से draw करने के बजाय review और correct करने दें
Endpoint को आपके task के अनुसार match करें — Precise detection के लिए YOLO, classification और understanding के लिए VLLMs
Confidence thresholds tune करें — 0.5 से शुरू करें और जो false positive/negative trade-off आप देखते हैं उसके आधार पर adjust करें
Annotator training के लिए hints का उपयोग करें — Hint mode annotators को किसी specific answer की ओर bias किए बिना guide करता है
Endpoints combine करें — Detection के लिए YOLO visual endpoint और hints के लिए Ollama text endpoint दोनों दुनियाओं का सबसे अच्छा देता है
AI results cache करें — Same images पर detection re-run करने से बचने के लिए disk caching enable करें

Troubleshooting

"No visual AI endpoint configured"

सुनिश्चित करें कि ai_support.enabled true है और आपने एक endpoint_type set किया है जो vision support करता है: yolo, ollama_vision, openai_vision, या anthropic_vision।

YOLO आपके objects detect नहीं कर रहा

YOLO के built-in classes 80 common objects cover करते हैं। यदि आपके labels YOLO के class names से match नहीं करते, तो open-vocabulary detection के लिए YOLO-World try करें, या confidence_threshold कम करें।

Ollama errors return कर रहा है

Verify करें कि Ollama चल रहा है और आपने एक vision model pull किया है:

bash

curl http://localhost:11434/api/tags  # Check Ollama is running
ollama list                           # Check installed models

Cloud APIs से slow response

Caching enable करें ताकि same image को दो बार analyze न किया जाए:

yaml

ai_support:
  cache_config:
    disk_cache:
      enabled: true
      path: "ai_cache/visual_cache.json"

अगले कदम

API reference details के लिए पूर्ण Visual AI Support documentation पढ़ें
अन्य content types के साथ images दिखाने के लिए instance display सेट करें
Hints और keyword highlighting के लिए text-based AI support explore करें

पूर्ण documentation /docs/features/visual-ai-support पर।