intermediateimage
TextVQA - Reading Text in Images
Visual question answering that requires reading and reasoning about text present in images. Based on the TextVQA dataset (Singh et al., CVPR 2019), annotators answer questions about images where understanding scene text (signs, labels, menus, etc.) is essential.
Archivo de configuraciónconfig.yaml
# TextVQA - Reading Text in Images
# Based on Singh et al., CVPR 2019
# Paper: https://arxiv.org/abs/1904.08920
# Dataset: https://textvqa.org/
#
# Visual question answering task requiring reading and reasoning about
# text present in natural images (signs, labels, book covers, menus, etc.).
# Annotators provide a text answer to each question and indicate answerability.
#
# Annotation Guidelines:
# 1. Examine the image carefully, paying attention to any visible text
# 2. Read the question and determine what text information is being asked about
# 3. Type your answer in the text field -- answers should be concise
# 4. Indicate whether the question is answerable from the image
# 5. If text in the image is too blurry or occluded, mark as Unanswerable
annotation_task_name: "TextVQA - Reading Text in Images"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
# Step 1: Provide the answer
- annotation_type: text
name: answer
description: "Type the answer to the question based on text visible in the image."
textarea: false
required: true
placeholder: "Type your answer here..."
# Step 2: Answerability judgment
- annotation_type: radio
name: answerability
description: "Can this question be answered from the text visible in the image?"
labels:
- "Answerable"
- "Unanswerable"
- "Ambiguous"
keyboard_shortcuts:
"Answerable": "1"
"Unanswerable": "2"
"Ambiguous": "3"
tooltips:
"Answerable": "The question can be clearly answered using text visible in the image"
"Unanswerable": "The required text is not visible, too blurry, or not present in the image"
"Ambiguous": "The answer is unclear due to partial occlusion, multiple possible readings, or vague question"
annotation_instructions: |
You will be shown an image along with a question that requires reading text in the image.
For each item:
1. Look at the image and identify any text (signs, labels, logos, numbers, etc.).
2. Read the question carefully.
3. Type a concise answer based on the text you can read in the image.
4. Indicate whether the question is answerable, unanswerable, or ambiguous.
Tips:
- Answers should be short and precise (usually 1-3 words).
- If the text is partially visible, provide your best reading.
- Mark as "Unanswerable" if the relevant text is not in the image at all.
html_layout: |
<div style="padding: 15px; max-width: 800px; margin: auto;">
<div style="text-align: center; margin-bottom: 16px; background: #1a1a1a; padding: 12px; border-radius: 8px;">
<img src="{{image_url}}" style="max-width: 100%; max-height: 500px; border-radius: 4px;" alt="Image for TextVQA" />
<p style="color: #999; font-size: 12px; margin-top: 8px;">{{image_description}}</p>
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Question:</strong>
<p style="font-size: 18px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false
Datos de ejemplosample-data.json
[
{
"id": "textvqa_001",
"text": "What is the name of the street shown on the green street sign?",
"image_url": "https://example.com/textvqa/street_sign_001.jpg",
"image_description": "A green street sign at an intersection in a suburban neighborhood"
},
{
"id": "textvqa_002",
"text": "What time does the store close according to the hours posted on the door?",
"image_url": "https://example.com/textvqa/store_hours_002.jpg",
"image_description": "Glass door of a retail store with posted business hours"
}
]
// ... and 8 more itemsObtener este diseño
View on GitHub
Clone or download from the repository
Inicio rápido:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/image/visual-qa/textvqa-reading-in-images potato start config.yaml
Detalles
Tipos de anotación
textradio
Dominio
Computer VisionNLP
Casos de uso
Visual Question AnsweringScene Text UnderstandingOCR
Etiquetas
textvqavqaocrscene-textreadingcvpr2019multimodal
¿Encontró un problema o desea mejorar este diseño?
Abrir un issueDiseños relacionados
Visual Question Answering
Answer questions about images for VQA dataset creation.
radiotext
CUB-200-2011 Fine-Grained Bird Classification
Fine-grained visual categorization of 200 bird species (Wah et al., 2011). Annotate bird images with species labels, part locations, and attribute annotations.
multiselectradio
EPIC-KITCHENS Egocentric Action Annotation
Annotate fine-grained actions in egocentric kitchen videos with verb-noun pairs. Identify cooking actions from a first-person perspective.
radiotext