# Extractive QA

Source: https://www.potatoannotator.com/docs/annotation-types/extractive-qa

The extractive QA annotation schema provides a question answering interface where annotators highlight answer spans directly in a text passage. This schema is ideal for reading comprehension dataset creation, SQuAD-style QA annotation, fact verification, and any task where answers are extracted verbatim from source text.

## Overview

The extractive QA schema presents:
- **A question** displayed prominently above the passage
- **A text passage** where annotators select answer spans by highlighting
- **Color-coded highlights** marking selected answer text
- **An unanswerable option** for questions that cannot be answered from the passage

## Quick Start

```yaml
annotation_schemes:
  - annotation_type: extractive_qa
    name: answer_span
    description: Highlight the answer to the question in the passage below.
    question_field: question
    passage_field: passage
    allow_unanswerable: true
```

## Configuration Options

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `annotation_type` | string | Required | Must be `"extractive_qa"` |
| `name` | string | Required | Unique identifier for this schema |
| `description` | string | Required | Instructions displayed to annotators |
| `question_field` | string | `"question"` | Field in the data JSON containing the question text |
| `passage_field` | string | `""` | Field in the data JSON containing the passage text (empty string uses the default text field) |
| `allow_unanswerable` | boolean | `true` | Show a checkbox for marking questions as unanswerable |
| `highlight_color` | string | `"#FFEB3B"` | CSS color for the answer highlight |

## Examples

### SQuAD-Style QA

```yaml
annotation_schemes:
  - annotation_type: extractive_qa
    name: squad_answer
    description: >
      Select the shortest span in the passage that answers the question.
      If the question cannot be answered from the passage, mark it as unanswerable.
    question_field: question
    passage_field: context
    allow_unanswerable: true
    highlight_color: "#FFEB3B"
```

With sample data:
```json
{
  "id": "q001",
  "question": "When was the university founded?",
  "context": "The University of Michigan was founded in 1817 in Detroit and moved to Ann Arbor in 1837. It is one of the oldest public universities in the United States."
}
```

### Fact Verification

```yaml
annotation_schemes:
  - annotation_type: extractive_qa
    name: evidence_span
    description: >
      Highlight the evidence in the passage that supports or refutes the claim.
      Mark as unanswerable if the passage contains no relevant evidence.
    question_field: claim
    passage_field: document
    allow_unanswerable: true
    highlight_color: "#81C784"
```

### Answer Extraction Without Unanswerable

```yaml
annotation_schemes:
  - annotation_type: extractive_qa
    name: definition_extraction
    description: >
      Highlight the definition of the term in the passage.
      Every passage contains a definition — select the most precise span.
    question_field: term
    passage_field: text
    allow_unanswerable: false
    highlight_color: "#64B5F6"
```

## Output Format

```json
{
  "answer_span": {
    "labels": {
      "answer_start": 45,
      "answer_end": 49,
      "answer_text": "1817",
      "unanswerable": false
    }
  }
}
```

When the annotator marks a question as unanswerable:
```json
{
  "answer_span": {
    "labels": {
      "unanswerable": true
    }
  }
}
```

## Best Practices

1. **Instruct annotators to select minimal spans** - the shortest text that fully answers the question produces cleaner training data
2. **Use allow_unanswerable for realistic tasks** - real-world QA often includes unanswerable questions; disabling this option forces annotators to guess
3. **Choose readable highlight colors** - ensure the highlight color has sufficient contrast with the text for easy reading
4. **Keep passages at a reasonable length** - 100-500 words per passage works well; very long passages make span selection tedious
5. **Provide clear question formatting** - ensure questions are well-formed and unambiguous to reduce annotator confusion

## Further Reading

- [Span Annotation](/docs/annotation-types/span-annotation) - General span labeling
- [Text & Number Input](/docs/annotation-types/text-number-input) - Free-text answer entry
- [AI Support](/docs/features/ai-support) - AI-assisted annotation hints

For implementation details, see the [source documentation](https://github.com/davidjurgens/potato/blob/main/docs/annotation_schemes.md).
