Potato 2.1: Instance Display, Visual AI, and Span Linking
Potato 2.1.0 brings the instance display system, visual AI support for image and video annotation, span linking, multi-field spans, and layout customization.
Potato 2.1: Instance Display, Visual AI, and Span Linking
We're excited to announce Potato 2.1.0, a feature-packed release that brings five major capabilities to the annotation platform. This update focuses on multi-modal content display, AI-powered visual annotation, and richer relationship annotation.
Instance Display System
The headline feature of v2.1 is the new instance_display configuration block. Previously, displaying an image alongside radio buttons required awkward workarounds like creating an image_annotation schema with min_annotations: 0. Now you can explicitly separate what content to show from what annotations to collect.
instance_display:
layout:
direction: horizontal
gap: 24px
fields:
- key: image_url
type: image
label: "Image to Classify"
display_options:
max_width: 600
zoomable: true
- key: description
type: text
label: "Context"
annotation_schemes:
- annotation_type: radio
name: category
labels: [nature, urban, people, objects]Instance display supports 11 content types: text, html, image, video, audio, dialogue, pairwise, code, spreadsheet, document, and pdf. You can combine multiple display fields with any annotation scheme, arrange them horizontally or vertically, and enable span annotation on text fields with span_target: true.
A standout feature is per-turn dialogue ratings — you can add inline Likert-scale rating widgets to individual conversation turns, allowing annotators to rate specific speakers without leaving the conversation view.
Read the full Instance Display documentation →
Multi-Field Span Annotation
Span annotation now supports a target_field option, enabling annotation across multiple text fields in the same data instance. This is essential for tasks like summarization evaluation where you need to annotate entities in both a source document and its summary.
annotation_schemes:
- annotation_type: span
name: source_entities
target_field: "source_text"
labels: [PERSON, ORGANIZATION, LOCATION]
- annotation_type: span
name: summary_entities
target_field: "summary"
labels: [PERSON, ORGANIZATION, LOCATION]Output annotations are keyed by field name, making it clear which text field each span belongs to.
Read the updated Span Annotation documentation →
Span Linking
The new span_link annotation type enables relation extraction by creating typed relationships between annotated spans. This unlocks tasks like knowledge graph construction, coreference resolution, and discourse analysis.
annotation_schemes:
- annotation_type: span
name: entities
labels:
- name: "PERSON"
color: "#3b82f6"
- name: "ORGANIZATION"
color: "#22c55e"
- annotation_type: span_link
name: relations
span_schema: entities
link_types:
- name: "WORKS_FOR"
directed: true
allowed_source_labels: ["PERSON"]
allowed_target_labels: ["ORGANIZATION"]
color: "#dc2626"
- name: "COLLABORATES_WITH"
directed: false
allowed_source_labels: ["PERSON"]
allowed_target_labels: ["PERSON"]
color: "#06b6d4"Key capabilities include directed and undirected links, n-ary relationships (links between more than two spans), visual arc display above the text, and label constraints that restrict which entity types can participate in each relationship type.
Read the full Span Linking documentation →
Visual AI Support
Potato 2.1 introduces four new vision endpoints that bring AI-powered assistance to image and video annotation tasks. This is a major expansion of Potato's AI capabilities beyond text.
Four Vision Endpoints
YOLO — Best for fast, precise object detection using local inference. Supports YOLOv8 variants and YOLO-World for open-vocabulary detection.
ai_support:
enabled: true
endpoint_type: "yolo"
ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5
iou_threshold: 0.45Ollama Vision — Run vision-language models locally with Ollama. Supports LLaVA, Llama 3.2 Vision, Qwen2.5-VL, BakLLaVA, and Moondream.
ai_support:
enabled: true
endpoint_type: "ollama_vision"
ai_config:
model: "llava:latest"
base_url: "http://localhost:11434"OpenAI Vision — Cloud-based vision analysis using GPT-4o with configurable detail levels.
ai_support:
enabled: true
endpoint_type: "openai_vision"
ai_config:
api_key: "${OPENAI_API_KEY}"
model: "gpt-4o"
detail: "auto"Anthropic Vision — Claude with vision capabilities for image understanding and classification.
ai_support:
enabled: true
endpoint_type: "anthropic_vision"
ai_config:
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-sonnet-4-20250514"Image AI Features
For image annotation tasks, visual AI provides four assistance modes:
- Detection — Finds objects matching your configured labels and draws suggestion bounding boxes as dashed overlays
- Pre-annotation (Auto) — Automatically detects all objects in the image and creates suggestions for human review
- Classification — Classifies a selected region or the entire image with a confidence score
- Hints — Provides guidance without revealing exact locations, useful for annotator training
annotation_schemes:
- annotation_type: image_annotation
name: object_detection
tools: [bbox, polygon]
labels:
- name: "person"
color: "#FF6B6B"
- name: "car"
color: "#4ECDC4"
ai_support:
enabled: true
features:
detection: true
pre_annotate: true
classification: false
hint: trueVideo AI Features
For video tasks, visual AI adds scene detection (identifying scene boundaries and suggesting temporal segments), keyframe detection (finding significant moments), and object tracking (suggesting positions across frames).
Accept/Reject Workflow
AI suggestions appear as dashed overlays that annotators can accept (double-click), reject (right-click), accept all, or clear all — keeping humans in the loop while accelerating annotation.
Separate Visual and Text Endpoints
You can configure different AI endpoints for text and visual tasks, using the best model for each content type:
ai_support:
enabled: true
endpoint_type: "ollama" # Text annotations
visual_endpoint_type: "yolo" # Image/video annotations
ai_config:
model: "llama3.2"
visual_ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5Read the full Visual AI Support documentation →
Layout Customization
Potato 2.1 adds support for sophisticated custom visual layouts. Potato generates an editable layouts/task_layout.html file by default, and you can provide a fully custom HTML template with CSS grid layouts, color-coded options, and section styling.
task_layout: layouts/custom_task_layout.htmlThree example layouts are included in project-hub/layout-examples/:
- Content moderation — Warning banner, 2-column grid, color-coded severity
- Dialogue QA — Case metadata, circular Likert ratings, grouped assessments
- Medical review — Professional medical styling, structured reporting
Custom layouts work alongside the new instance_display system — display content renders above your custom annotation forms.
Read the full Layout Customization documentation →
Other Improvements
Label Rationales
A fourth AI capability joins hints, keyword highlighting, and label suggestions. Rationales generate balanced explanations for why each label might apply, helping annotators understand the reasoning behind different classifications.
ai_support:
features:
rationales:
enabled: trueBug Fixes and Testing
- 50+ new tests for improved reliability
- Responsive design improvements across annotation types
- Enhanced project-hub organization with layout examples
Upgrading to v2.1
pip install --upgrade potato-annotationExisting v2.0 configurations work without changes — all new features are opt-in through additional config blocks like instance_display, span_link schemes, and visual AI endpoints.
Getting Started
- What's New — Full v2.1 feature overview
- Instance Display — Multi-modal content display
- Visual AI Support — AI for image and video annotation
- Span Linking — Entity relationship annotation
- Layout Customization — Custom HTML templates
Have questions or feedback? Join our Discord or open an issue on GitHub.