Note: This post describes Potato 2.1 as it was at release. Some configuration keys and features have been updated in later versions. See the current documentation for up-to-date configuration syntax.

Potato 2.1.0 is out. It adds five things: a proper instance display system, AI assistance for image and video annotation, span linking, multi-field spans, and custom layouts.

Instance display system

The big addition in 2.1 is the instance_display config block. Before this, showing an image next to a set of radio buttons meant an awkward workaround, like creating an image_annotation schema with min_annotations: 0. Now you can say what to show and what to collect separately.

yaml

instance_display:
  layout:
    direction: horizontal
    gap: 24px
  fields:
    - key: image_url
      type: image
      label: "Image to Classify"
      display_options:
        max_width: 600
        zoomable: true
    - key: description
      type: text
      label: "Context"
 
annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [nature, urban, people, objects]

It supports 11 content types: text, html, image, video, audio, dialogue, pairwise, code, spreadsheet, document, and pdf. You can mix several display fields with any annotation scheme, lay them out horizontally or vertically, and turn on span annotation for text fields with span_target: true.

One feature people have been asking for is per-turn dialogue ratings. You can drop an inline Likert widget onto individual conversation turns, so annotators rate a specific speaker without leaving the conversation view.

Read the full instance display documentation →

Multi-field span annotation

Span annotation now takes a target_field option, which lets you annotate across more than one text field in the same instance. You need this for tasks like summarization evaluation, where you mark up entities in both the source document and the summary.

yaml

annotation_schemes:
  - annotation_type: span
    name: source_entities
    target_field: "source_text"
    labels: [PERSON, ORGANIZATION, LOCATION]
 
  - annotation_type: span
    name: summary_entities
    target_field: "summary"
    labels: [PERSON, ORGANIZATION, LOCATION]

Output annotations are keyed by field name, so it is clear which text field each span came from.

Read the updated span annotation documentation →

Span linking

The new span_link annotation type handles relation extraction: you create typed relationships between spans you have already annotated. That covers tasks like knowledge graph construction, coreference resolution, and discourse analysis.

yaml

annotation_schemes:
  - annotation_type: span
    name: entities
    labels:
      - name: "PERSON"
        color: "#3b82f6"
      - name: "ORGANIZATION"
        color: "#22c55e"
 
  - annotation_type: span_link
    name: relations
    span_schema: entities
    link_types:
      - name: "WORKS_FOR"
        directed: true
        allowed_source_labels: ["PERSON"]
        allowed_target_labels: ["ORGANIZATION"]
        color: "#dc2626"
      - name: "COLLABORATES_WITH"
        directed: false
        allowed_source_labels: ["PERSON"]
        allowed_target_labels: ["PERSON"]
        color: "#06b6d4"

Links can be directed or undirected, n-ary (joining more than two spans), and they show up as arcs drawn above the text. Label constraints let you say which entity types can take part in each relationship type.

Read the full span linking documentation →

Visual AI support

Potato 2.1 adds four vision endpoints, which bring AI assistance to image and video annotation for the first time. Until now the AI features only worked on text.

Four vision endpoints

YOLO is the one to use for fast, precise object detection with local inference. It supports YOLOv8 variants and YOLO-World for open-vocabulary detection.

yaml

ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
    iou_threshold: 0.45

Ollama Vision runs vision-language models locally through Ollama. It supports LLaVA, Llama 3.2 Vision, Qwen2.5-VL, BakLLaVA, and Moondream.

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    base_url: "http://localhost:11434"

OpenAI Vision does cloud-based vision analysis with GPT-4o and configurable detail levels.

yaml

ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"
    detail: "auto"

Anthropic Vision uses Claude's vision capabilities for image understanding and classification.

yaml

ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"

Image AI features

On image tasks, visual AI gives you four assistance modes:

Detection finds objects matching your labels and draws suggestion bounding boxes as dashed overlays
Pre-annotation (Auto) detects every object in the image and creates suggestions for a human to review
Classification labels a selected region or the whole image and gives a confidence score
Hints point annotators in the right direction without revealing the exact location, which is handy for training

yaml

annotation_schemes:
  - annotation_type: image_annotation
    name: object_detection
    tools: [bbox, polygon]
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
        classification: false
        hint: true

Video AI features

On video tasks, visual AI adds scene detection (it finds scene boundaries and suggests temporal segments), keyframe detection (it picks out significant moments), and object tracking (it suggests where an object sits across frames).

Accept and reject workflow

AI suggestions show up as dashed overlays. Annotators can accept one (double-click), reject one (right-click), accept all, or clear all. The human stays in the loop, the AI just does the tedious first pass.

Separate visual and text endpoints

You can point text tasks and visual tasks at different AI endpoints, so each content type uses the model that suits it:

yaml

ai_support:
  enabled: true
  endpoint_type: "ollama"          # Text annotations
  visual_endpoint_type: "yolo"     # Image/video annotations
  ai_config:
    model: "llama3.2"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5

Read the full visual AI support documentation →

Layout customization

Potato 2.1 lets you build custom visual layouts. By default it generates an editable layouts/task_layout.html file, and you can swap in your own HTML template with CSS grid layouts, color-coded options, and section styling.

yaml

task_layout: layouts/custom_task_layout.html

There are three example layouts in project-hub/layout-examples/. The content moderation one has a warning banner, a two-column grid, and color-coded severity. The dialogue QA one shows case metadata, circular Likert ratings, and grouped assessments. The medical review one uses a structured reporting style.

Custom layouts play nicely with the new instance_display system: display content renders above your custom annotation forms.

Read the full layout customization documentation →

Other improvements

Label rationales

A fourth AI capability joins hints, keyword highlighting, and label suggestions. Rationales write a balanced explanation of why each label might apply, which helps annotators see the reasoning behind a classification rather than just guessing.

yaml

ai_support:
  features:
    rationales:
      enabled: true

Bug fixes and testing

Over 50 new tests for better reliability
Responsive design fixes across the annotation types
A tidier project-hub, now with layout examples

Upgrading to v2.1

bash

pip install --upgrade potato-annotation

Your existing v2.0 configs keep working without changes. Everything new is opt-in, through extra config blocks like instance_display, span_link schemes, and visual AI endpoints.

Getting started

What's New, the full v2.1 feature overview
Instance Display, multi-modal content display
Visual AI Support, AI for image and video annotation
Span Linking, entity relationship annotation
Layout Customization, custom HTML templates

For the full changelog and any config keys that changed, see the v2.1.0 release notes in the repository.

Have questions or feedback? Join our Discord or open an issue on GitHub.