TextVQA - Reading Text in Images

Visual question answering that requires reading and reasoning about text present in images. Based on the TextVQA dataset (Singh et al., CVPR 2019), annotators answer questions about images where understanding scene text (signs, labels, menus, etc.) is essential.

Archivo de configuraciónconfig.yaml

# TextVQA - Reading Text in Images
# Based on Singh et al., CVPR 2019
# Paper: https://arxiv.org/abs/1904.08920
# Dataset: https://textvqa.org/
#
# Visual question answering task requiring reading and reasoning about
# text present in natural images (signs, labels, book covers, menus, etc.).
# Annotators provide a text answer to each question and indicate answerability.
#
# Annotation Guidelines:
# 1. Examine the image carefully, paying attention to any visible text
# 2. Read the question and determine what text information is being asked about
# 3. Type your answer in the text field -- answers should be concise
# 4. Indicate whether the question is answerable from the image
# 5. If text in the image is too blurry or occluded, mark as Unanswerable

annotation_task_name: "TextVQA - Reading Text in Images"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Provide the answer
  - annotation_type: text
    name: answer
    description: "Type the answer to the question based on text visible in the image."
    textarea: false
    required: true
    placeholder: "Type your answer here..."

  # Step 2: Answerability judgment
  - annotation_type: radio
    name: answerability
    description: "Can this question be answered from the text visible in the image?"
    labels:
      - "Answerable"
      - "Unanswerable"
      - "Ambiguous"
    keyboard_shortcuts:
      "Answerable": "1"
      "Unanswerable": "2"
      "Ambiguous": "3"
    tooltips:
      "Answerable": "The question can be clearly answered using text visible in the image"
      "Unanswerable": "The required text is not visible, too blurry, or not present in the image"
      "Ambiguous": "The answer is unclear due to partial occlusion, multiple possible readings, or vague question"

annotation_instructions: |
  You will be shown an image along with a question that requires reading text in the image.

  For each item:
  1. Look at the image and identify any text (signs, labels, logos, numbers, etc.).
  2. Read the question carefully.
  3. Type a concise answer based on the text you can read in the image.
  4. Indicate whether the question is answerable, unanswerable, or ambiguous.

  Tips:
  - Answers should be short and precise (usually 1-3 words).
  - If the text is partially visible, provide your best reading.
  - Mark as "Unanswerable" if the relevant text is not in the image at all.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="text-align: center; margin-bottom: 16px; background: #1a1a1a; padding: 12px; border-radius: 8px;">
      <img src="{{image_url}}" style="max-width: 100%; max-height: 500px; border-radius: 4px;" alt="Image for TextVQA" />
      <p style="color: #999; font-size: 12px; margin-top: 8px;">{{image_description}}</p>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Question:</strong>
      <p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Datos de ejemplosample-data.json

[
  {
    "id": "textvqa_001",
    "text": "What is the name of the street shown on the green street sign?",
    "image_url": "https://example.com/textvqa/street_sign_001.jpg",
    "image_description": "A green street sign at an intersection in a suburban neighborhood"
  },
  {
    "id": "textvqa_002",
    "text": "What time does the store close according to the hours posted on the door?",
    "image_url": "https://example.com/textvqa/store_hours_002.jpg",
    "image_description": "Glass door of a retail store with posted business hours"
  }
]

// ... and 8 more items

Obtener este diseño

View on GitHub

Clone or download from the repository

Inicio rápido:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/image/visual-qa/textvqa-reading-in-images
potato start config.yaml

Detalles

Tipos de anotación

textradio

Dominio

Computer VisionNLP

Casos de uso

Visual Question AnsweringScene Text UnderstandingOCR

Etiquetas

textvqavqaocrscene-textreadingcvpr2019multimodal

¿Encontró un problema o desea mejorar este diseño?

Abrir un issue

Diseños relacionados

Visual Question Answering

Answer questions about images for VQA dataset creation.

radiotext

CUB-200-2011 Fine-Grained Bird Classification

Fine-grained visual categorization of 200 bird species (Wah et al., 2011). Annotate bird images with species labels, part locations, and attribute annotations.

multiselectradio

EPIC-KITCHENS Egocentric Action Annotation

Annotate fine-grained actions in egocentric kitchen videos with verb-noun pairs. Identify cooking actions from a first-person perspective.

radiotext