VQA v2.0 - Visual Question Answering

Visual question answering requiring annotators to answer natural language questions about images, based on the VQA v2.0 dataset (Goyal et al., CVPR 2017). Supports both open-ended and yes/no question types.

Configuration Fileconfig.yaml

# VQA v2.0 - Visual Question Answering
# Based on Goyal et al., CVPR 2017
# Paper: https://arxiv.org/abs/1612.00837
# Dataset: https://visualqa.org/
#
# This task presents an image and a natural language question about it.
# Annotators provide a free-form answer and, for yes/no questions,
# also select from a radio button group.
#
# Question Types:
# - yes/no: Questions that can be answered with yes or no
# - number: Questions asking about counts or quantities
# - other: Open-ended questions requiring descriptive answers
#
# Annotation Guidelines:
# 1. Look at the image carefully
# 2. Read the question
# 3. Provide a concise, accurate answer in the text field
# 4. For yes/no questions, also select the appropriate radio button
# 5. Answer based only on what you can see in the image

annotation_task_name: "VQA v2.0 - Visual Question Answering"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Free-form answer
  - annotation_type: text
    name: answer
    description: "Provide a concise answer to the question about the image"

  # Step 2: Yes/No/NA for applicable questions
  - annotation_type: radio
    name: yes_no_answer
    description: "For yes/no questions, select the appropriate answer. For other question types, select Not Applicable."
    labels:
      - "Yes"
      - "No"
      - "Not Applicable"
    keyboard_shortcuts:
      "Yes": "1"
      "No": "2"
      "Not Applicable": "3"
    tooltips:
      "Yes": "The answer to the yes/no question is yes"
      "No": "The answer to the yes/no question is no"
      "Not Applicable": "The question is not a yes/no question"

annotation_instructions: |
  You will be shown an image and a question about it. Your task is to:
  1. Study the image carefully.
  2. Read the question.
  3. Type a concise answer in the text field (1-3 words preferred).
  4. If the question is a yes/no question, also select Yes or No. Otherwise, select Not Applicable.

  Answer based only on what you can see in the image. Keep answers brief and specific.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 16px; margin-bottom: 16px; text-align: center;">
      <img src="{{image_url}}" style="max-width: 100%; max-height: 500px; border-radius: 4px;" alt="Image for question answering" />
    </div>
    <div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #a16207; font-size: 18px;">Question:</strong>
      <p style="font-size: 17px; line-height: 1.6; margin: 8px 0 0 0; font-weight: 500;">{{text}}</p>
      <p style="font-size: 13px; color: #6b7280; margin: 8px 0 0 0;">Question type: {{question_type}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "vqa_001",
    "text": "What color is the fire hydrant?",
    "image_url": "images/vqa_001.jpg",
    "question_type": "other"
  },
  {
    "id": "vqa_002",
    "text": "Is the person wearing a hat?",
    "image_url": "images/vqa_002.jpg",
    "question_type": "yes/no"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/image/visual-qa/vqav2-visual-question-answering
potato start config.yaml

Details

Annotation Types

textradio

Domain

Computer VisionVisual Question Answering

Use Cases

Visual QAImage UnderstandingMultimodal Reasoning

Related Designs

CUB-200-2011 Fine-Grained Bird Classification

Fine-grained visual categorization of 200 bird species (Wah et al., 2011). Annotate bird images with species labels, part locations, and attribute annotations.

multiselectradio

EPIC-KITCHENS Egocentric Action Annotation

Annotate fine-grained actions in egocentric kitchen videos with verb-noun pairs. Identify cooking actions from a first-person perspective.

radiotext

FLAIR: French Land Cover from Aerospace Imagery

Land use and land cover classification from high-resolution aerial imagery. Annotators classify the primary land use category of aerial image patches and identify any secondary land uses present. Based on the FLAIR dataset from the French National Institute of Geographic and Forest Information (IGN).