RefCOCO - Referring Expression Grounding

Visual grounding task where annotators draw bounding boxes around objects referred to by natural language expressions. Based on the RefCOCO dataset (Yu et al., ECCV 2016), this task links referring expressions to spatial regions in images.

Archivo de configuraciónconfig.yaml

# RefCOCO - Referring Expression Grounding
# Based on Yu et al., ECCV 2016
# Paper: https://arxiv.org/abs/1608.00272
# Dataset: https://github.com/lichengunc/refer
#
# Visual grounding task: given an image and a referring expression, annotators
# draw a bounding box around the object described by the expression. This tests
# the ability to link natural language to spatial image regions.
#
# Annotation Guidelines:
# 1. Read the referring expression carefully
# 2. Examine the image and identify the object being described
# 3. Draw a tight bounding box around the referred object
# 4. If the expression is ambiguous, optionally write a clarification
# 5. The bounding box should be as tight as possible around the object

annotation_task_name: "RefCOCO - Referring Expression Grounding"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Draw bounding box around the referred object
  - annotation_type: image_annotation
    name: referred_object
    description: "Draw a bounding box around the object described by the referring expression."
    tools:
      - bbox
    labels:
      - "Referred Object"

  # Step 2: Optionally provide a clarifying expression
  - annotation_type: text
    name: clarification
    description: "If the original expression is ambiguous, write a more precise referring expression."
    textarea: false
    required: false
    placeholder: "Optional: write a clearer referring expression..."

annotation_instructions: |
  You will see an image and a referring expression that describes one specific object in the image.

  For each item:
  1. Read the referring expression below the image.
  2. Identify which object in the image is being described.
  3. Use the bounding box tool to draw a tight box around that object.
  4. If the expression is ambiguous or could refer to multiple objects,
     draw the box around the most likely referent and optionally write
     a clearer expression in the text field.

  Tips:
  - Draw the box as tightly as possible around the entire object.
  - Referring expressions often use spatial relationships (e.g., "the dog on the left").
  - Consider attributes like color, size, and position to disambiguate.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="text-align: center; margin-bottom: 16px; background: #1a1a1a; padding: 12px; border-radius: 8px;">
      <img src="{{image_url}}" style="max-width: 100%; max-height: 500px; border-radius: 4px;" alt="Image for grounding" />
    </div>
    <div style="background: #fef3c7; border: 1px solid #fbbf24; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #92400e;">Referring Expression:</strong>
      <p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0; font-style: italic;">{{text}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Datos de ejemplosample-data.json

[
  {
    "id": "refcoco_001",
    "text": "the woman in the red dress standing near the fountain",
    "image_url": "https://example.com/refcoco/park_scene_001.jpg"
  },
  {
    "id": "refcoco_002",
    "text": "the small brown dog on the left side of the bench",
    "image_url": "https://example.com/refcoco/bench_dogs_002.jpg"
  }
]

// ... and 8 more items

Obtener este diseño

View on GitHub

Clone or download from the repository

Inicio rápido:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/image/visual-grounding/refcoco-expression
potato start config.yaml

Detalles

Tipos de anotación

image_annotationtext

Dominio

Computer VisionNLP

Casos de uso

Visual GroundingReferring Expression ComprehensionObject Localization

Etiquetas

refcocovisual-groundingreferring-expressionbounding-boxeccv2016multimodal

¿Encontró un problema o desea mejorar este diseño?

Abrir un issue

Diseños relacionados

TextVQA - Reading Text in Images

Visual question answering that requires reading and reasoning about text present in images. Based on the TextVQA dataset (Singh et al., CVPR 2019), annotators answer questions about images where understanding scene text (signs, labels, menus, etc.) is essential.

textradio

Visual Question Answering

Answer questions about images for VQA dataset creation.

radiotext

ActivityNet Captions Dense Annotation

Dense temporal annotation with natural language descriptions. Annotators segment videos into events and write descriptive captions for each temporal segment.

video_annotationtext