Potato 2.1: Instance Display, Visual AI, and Span Linking
Potato 2.1.0 brings the instance display system, visual AI support for image and video annotation, span linking, multi-field spans, and layout customization.
Note: This post describes Potato 2.1 as it was at release. Some configuration keys and features have been updated in later versions. See the current documentation for up-to-date configuration syntax.
Potato 2.1.0 is out. It adds five things: a proper instance display system, AI assistance for image and video annotation, span linking, multi-field spans, and custom layouts.
Instance display system
The big addition in 2.1 is the instance_display config block. Before this, showing an image next to a set of radio buttons meant an awkward workaround, like creating an image_annotation schema with min_annotations: 0. Now you can say what to show and what to collect separately.
instance_display:
layout:
direction: horizontal
gap: 24px
fields:
- key: image_url
type: image
label: "Image to Classify"
display_options:
max_width: 600
zoomable: true
- key: description
type: text
label: "Context"
annotation_schemes:
- annotation_type: radio
name: category
labels: [nature, urban, people, objects]It supports 11 content types: text, html, image, video, audio, dialogue, pairwise, code, spreadsheet, document, and pdf. You can mix several display fields with any annotation scheme, lay them out horizontally or vertically, and turn on span annotation for text fields with span_target: true.
One feature people have been asking for is per-turn dialogue ratings. You can drop an inline Likert widget onto individual conversation turns, so annotators rate a specific speaker without leaving the conversation view.
Read the full instance display documentation →
Multi-field span annotation
Span annotation now takes a target_field option, which lets you annotate across more than one text field in the same instance. You need this for tasks like summarization evaluation, where you mark up entities in both the source document and the summary.
annotation_schemes:
- annotation_type: span
name: source_entities
target_field: "source_text"
labels: [PERSON, ORGANIZATION, LOCATION]
- annotation_type: span
name: summary_entities
target_field: "summary"
labels: [PERSON, ORGANIZATION, LOCATION]Output annotations are keyed by field name, so it is clear which text field each span came from.
Read the updated span annotation documentation →
Span linking
The new span_link annotation type handles relation extraction: you create typed relationships between spans you have already annotated. That covers tasks like knowledge graph construction, coreference resolution, and discourse analysis.
annotation_schemes:
- annotation_type: span
name: entities
labels:
- name: "PERSON"
color: "#3b82f6"
- name: "ORGANIZATION"
color: "#22c55e"
- annotation_type: span_link
name: relations
span_schema: entities
link_types:
- name: "WORKS_FOR"
directed: true
allowed_source_labels: ["PERSON"]
allowed_target_labels: ["ORGANIZATION"]
color: "#dc2626"
- name: "COLLABORATES_WITH"
directed: false
allowed_source_labels: ["PERSON"]
allowed_target_labels: ["PERSON"]
color: "#06b6d4"Links can be directed or undirected, n-ary (joining more than two spans), and they show up as arcs drawn above the text. Label constraints let you say which entity types can take part in each relationship type.
Read the full span linking documentation →
Visual AI support
Potato 2.1 adds four vision endpoints, which bring AI assistance to image and video annotation for the first time. Until now the AI features only worked on text.
Four vision endpoints
YOLO is the one to use for fast, precise object detection with local inference. It supports YOLOv8 variants and YOLO-World for open-vocabulary detection.
ai_support:
enabled: true
endpoint_type: "yolo"
ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5
iou_threshold: 0.45Ollama Vision runs vision-language models locally through Ollama. It supports LLaVA, Llama 3.2 Vision, Qwen2.5-VL, BakLLaVA, and Moondream.
ai_support:
enabled: true
endpoint_type: "ollama_vision"
ai_config:
model: "llava:latest"
base_url: "http://localhost:11434"OpenAI Vision does cloud-based vision analysis with GPT-4o and configurable detail levels.
ai_support:
enabled: true
endpoint_type: "openai_vision"
ai_config:
api_key: "${OPENAI_API_KEY}"
model: "gpt-4o"
detail: "auto"Anthropic Vision uses Claude's vision capabilities for image understanding and classification.
ai_support:
enabled: true
endpoint_type: "anthropic_vision"
ai_config:
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-sonnet-4-20250514"Image AI features
On image tasks, visual AI gives you four assistance modes:
- Detection finds objects matching your labels and draws suggestion bounding boxes as dashed overlays
- Pre-annotation (Auto) detects every object in the image and creates suggestions for a human to review
- Classification labels a selected region or the whole image and gives a confidence score
- Hints point annotators in the right direction without revealing the exact location, which is handy for training
annotation_schemes:
- annotation_type: image_annotation
name: object_detection
tools: [bbox, polygon]
labels:
- name: "person"
color: "#FF6B6B"
- name: "car"
color: "#4ECDC4"
ai_support:
enabled: true
features:
detection: true
pre_annotate: true
classification: false
hint: trueVideo AI features
On video tasks, visual AI adds scene detection (it finds scene boundaries and suggests temporal segments), keyframe detection (it picks out significant moments), and object tracking (it suggests where an object sits across frames).
Accept and reject workflow
AI suggestions show up as dashed overlays. Annotators can accept one (double-click), reject one (right-click), accept all, or clear all. The human stays in the loop, the AI just does the tedious first pass.
Separate visual and text endpoints
You can point text tasks and visual tasks at different AI endpoints, so each content type uses the model that suits it:
ai_support:
enabled: true
endpoint_type: "ollama" # Text annotations
visual_endpoint_type: "yolo" # Image/video annotations
ai_config:
model: "llama3.2"
visual_ai_config:
model: "yolov8m.pt"
confidence_threshold: 0.5Read the full visual AI support documentation →
Layout customization
Potato 2.1 lets you build custom visual layouts. By default it generates an editable layouts/task_layout.html file, and you can swap in your own HTML template with CSS grid layouts, color-coded options, and section styling.
task_layout: layouts/custom_task_layout.htmlThere are three example layouts in project-hub/layout-examples/. The content moderation one has a warning banner, a two-column grid, and color-coded severity. The dialogue QA one shows case metadata, circular Likert ratings, and grouped assessments. The medical review one uses a structured reporting style.
Custom layouts play nicely with the new instance_display system: display content renders above your custom annotation forms.
Read the full layout customization documentation →
Other improvements
Label rationales
A fourth AI capability joins hints, keyword highlighting, and label suggestions. Rationales write a balanced explanation of why each label might apply, which helps annotators see the reasoning behind a classification rather than just guessing.
ai_support:
features:
rationales:
enabled: trueBug fixes and testing
- Over 50 new tests for better reliability
- Responsive design fixes across the annotation types
- A tidier project-hub, now with layout examples
Upgrading to v2.1
pip install --upgrade potato-annotationYour existing v2.0 configs keep working without changes. Everything new is opt-in, through extra config blocks like instance_display, span_link schemes, and visual AI endpoints.
Getting started
- What's New, the full v2.1 feature overview
- Instance Display, multi-modal content display
- Visual AI Support, AI for image and video annotation
- Span Linking, entity relationship annotation
- Layout Customization, custom HTML templates
For the full changelog and any config keys that changed, see the v2.1.0 release notes in the repository.
Have questions or feedback? Join our Discord or open an issue on GitHub.