# Potato 2.1: Instance Display, Visual AI, and Span Linking

Source: https://www.potatoannotator.com/blog/potato-2-1-release

> **Note:** This post describes Potato 2.1 as it was at release. Some configuration keys and features have been updated in later versions. See the [current documentation](/docs) for up-to-date configuration syntax.

Potato 2.1.0 is out. It adds five things: a proper instance display system, AI assistance for image and video annotation, span linking, multi-field spans, and custom layouts.

## Instance display system

The big addition in 2.1 is the `instance_display` config block. Before this, showing an image next to a set of radio buttons meant an awkward workaround, like creating an `image_annotation` schema with `min_annotations: 0`. Now you can say what to show and what to collect separately.

```yaml
instance_display:
  layout:
    direction: horizontal
    gap: 24px
  fields:
    - key: image_url
      type: image
      label: "Image to Classify"
      display_options:
        max_width: 600
        zoomable: true
    - key: description
      type: text
      label: "Context"

annotation_schemes:
  - annotation_type: radio
    name: category
    labels: [nature, urban, people, objects]
```

It supports 11 content types: `text`, `html`, `image`, `video`, `audio`, `dialogue`, `pairwise`, `code`, `spreadsheet`, `document`, and `pdf`. You can mix several display fields with any annotation scheme, lay them out horizontally or vertically, and turn on span annotation for text fields with `span_target: true`.

One feature people have been asking for is per-turn dialogue ratings. You can drop an inline Likert widget onto individual conversation turns, so annotators rate a specific speaker without leaving the conversation view.

[Read the full instance display documentation →](/docs/core-concepts/instance-display)

## Multi-field span annotation

Span annotation now takes a `target_field` option, which lets you annotate across more than one text field in the same instance. You need this for tasks like summarization evaluation, where you mark up entities in both the source document and the summary.

```yaml
annotation_schemes:
  - annotation_type: span
    name: source_entities
    target_field: "source_text"
    labels: [PERSON, ORGANIZATION, LOCATION]

  - annotation_type: span
    name: summary_entities
    target_field: "summary"
    labels: [PERSON, ORGANIZATION, LOCATION]
```

Output annotations are keyed by field name, so it is clear which text field each span came from.

[Read the updated span annotation documentation →](/docs/annotation-types/span-annotation)

## Span linking

The new `span_link` annotation type handles relation extraction: you create typed relationships between spans you have already annotated. That covers tasks like knowledge graph construction, coreference resolution, and discourse analysis.

```yaml
annotation_schemes:
  - annotation_type: span
    name: entities
    labels:
      - name: "PERSON"
        color: "#3b82f6"
      - name: "ORGANIZATION"
        color: "#22c55e"

  - annotation_type: span_link
    name: relations
    span_schema: entities
    link_types:
      - name: "WORKS_FOR"
        directed: true
        allowed_source_labels: ["PERSON"]
        allowed_target_labels: ["ORGANIZATION"]
        color: "#dc2626"
      - name: "COLLABORATES_WITH"
        directed: false
        allowed_source_labels: ["PERSON"]
        allowed_target_labels: ["PERSON"]
        color: "#06b6d4"
```

Links can be directed or undirected, n-ary (joining more than two spans), and they show up as arcs drawn above the text. Label constraints let you say which entity types can take part in each relationship type.

[Read the full span linking documentation →](/docs/annotation-types/span-linking)

## Visual AI support

Potato 2.1 adds four vision endpoints, which bring AI assistance to image and video annotation for the first time. Until now the AI features only worked on text.

### Four vision endpoints

YOLO is the one to use for fast, precise object detection with local inference. It supports YOLOv8 variants and YOLO-World for open-vocabulary detection.

```yaml
ai_support:
  enabled: true
  endpoint_type: "yolo"
  ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
    iou_threshold: 0.45
```

Ollama Vision runs vision-language models locally through Ollama. It supports LLaVA, Llama 3.2 Vision, Qwen2.5-VL, BakLLaVA, and Moondream.

```yaml
ai_support:
  enabled: true
  endpoint_type: "ollama_vision"
  ai_config:
    model: "llava:latest"
    base_url: "http://localhost:11434"
```

OpenAI Vision does cloud-based vision analysis with GPT-4o and configurable detail levels.

```yaml
ai_support:
  enabled: true
  endpoint_type: "openai_vision"
  ai_config:
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4o"
    detail: "auto"
```

Anthropic Vision uses Claude's vision capabilities for image understanding and classification.

```yaml
ai_support:
  enabled: true
  endpoint_type: "anthropic_vision"
  ai_config:
    api_key: "${ANTHROPIC_API_KEY}"
    model: "claude-sonnet-4-20250514"
```

### Image AI features

On image tasks, visual AI gives you four assistance modes:

- Detection finds objects matching your labels and draws suggestion bounding boxes as dashed overlays
- Pre-annotation (Auto) detects every object in the image and creates suggestions for a human to review
- Classification labels a selected region or the whole image and gives a confidence score
- Hints point annotators in the right direction without revealing the exact location, which is handy for training

```yaml
annotation_schemes:
  - annotation_type: image_annotation
    name: object_detection
    tools: [bbox, polygon]
    labels:
      - name: "person"
        color: "#FF6B6B"
      - name: "car"
        color: "#4ECDC4"
    ai_support:
      enabled: true
      features:
        detection: true
        pre_annotate: true
        classification: false
        hint: true
```

### Video AI features

On video tasks, visual AI adds scene detection (it finds scene boundaries and suggests temporal segments), keyframe detection (it picks out significant moments), and object tracking (it suggests where an object sits across frames).

### Accept and reject workflow

AI suggestions show up as dashed overlays. Annotators can accept one (double-click), reject one (right-click), accept all, or clear all. The human stays in the loop, the AI just does the tedious first pass.

### Separate visual and text endpoints

You can point text tasks and visual tasks at different AI endpoints, so each content type uses the model that suits it:

```yaml
ai_support:
  enabled: true
  endpoint_type: "ollama"          # Text annotations
  visual_endpoint_type: "yolo"     # Image/video annotations
  ai_config:
    model: "llama3.2"
  visual_ai_config:
    model: "yolov8m.pt"
    confidence_threshold: 0.5
```

[Read the full visual AI support documentation →](/docs/features/visual-ai-support)

## Layout customization

Potato 2.1 lets you build custom visual layouts. By default it generates an editable `layouts/task_layout.html` file, and you can swap in your own HTML template with CSS grid layouts, color-coded options, and section styling.

```yaml
task_layout: layouts/custom_task_layout.html
```

There are three example layouts in `project-hub/layout-examples/`. The content moderation one has a warning banner, a two-column grid, and color-coded severity. The dialogue QA one shows case metadata, circular Likert ratings, and grouped assessments. The medical review one uses a structured reporting style.

Custom layouts play nicely with the new `instance_display` system: display content renders above your custom annotation forms.

[Read the full layout customization documentation →](/docs/features/layout-customization)

## Other improvements

### Label rationales

A fourth AI capability joins hints, keyword highlighting, and label suggestions. Rationales write a balanced explanation of why each label might apply, which helps annotators see the reasoning behind a classification rather than just guessing.

```yaml
ai_support:
  features:
    rationales:
      enabled: true
```

### Bug fixes and testing

- Over 50 new tests for better reliability
- Responsive design fixes across the annotation types
- A tidier project-hub, now with layout examples

## Upgrading to v2.1

```bash
pip install --upgrade potato-annotation
```

Your existing v2.0 configs keep working without changes. Everything new is opt-in, through extra config blocks like `instance_display`, `span_link` schemes, and visual AI endpoints.

## Getting started

- [What's New](/docs/getting-started/whats-new-v2), the full v2.1 feature overview
- [Instance Display](/docs/core-concepts/instance-display), multi-modal content display
- [Visual AI Support](/docs/features/visual-ai-support), AI for image and video annotation
- [Span Linking](/docs/annotation-types/span-linking), entity relationship annotation
- [Layout Customization](/docs/features/layout-customization), custom HTML templates

For the full changelog and any config keys that changed, see the [v2.1.0 release notes](https://github.com/davidjurgens/potato/blob/master/docs/releasenotes/v2.1.0.md) in the repository.

---

*Have questions or feedback? Join our [Discord](https://discord.gg/TDWQAqU3) or open an issue on [GitHub](https://github.com/davidjurgens/potato/issues).*