LLMs4Subjects - Automated Subject Tagging
Automated subject classification of academic texts, requiring annotators to assign subject categories and determine whether texts span single or multiple disciplines. Based on SemEval-2025 Task 5.
Configuration Fileconfig.yaml
# LLMs4Subjects - Automated Subject Tagging
# Based on Sinhababu et al., SemEval 2025
# Paper: https://aclanthology.org/volumes/2025.semeval-1/
# Dataset: https://github.com/SemEval/SemEval2025-Task5
#
# This task involves assigning subject categories to academic text
# passages. Annotators select all applicable subject areas and
# indicate whether the text belongs to a single discipline or
# spans multiple fields.
#
# Subject Categories:
# - Computer Science, Mathematics, Physics, Biology, Medicine,
# Engineering, Social Science, Humanities, Law, Economics
#
# Classification:
# - Single Subject: Text belongs to one clear discipline
# - Multi-Subject: Text spans multiple disciplines
# - Unclear: Subject classification is ambiguous
annotation_task_name: "LLMs4Subjects - Automated Subject Tagging"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
- annotation_type: multiselect
name: subject_categories
description: "Select all subject areas that apply to this text."
labels:
- "Computer Science"
- "Mathematics"
- "Physics"
- "Biology"
- "Medicine"
- "Engineering"
- "Social Science"
- "Humanities"
- "Law"
- "Economics"
tooltips:
"Computer Science": "Algorithms, programming, AI, databases, software, etc."
"Mathematics": "Pure or applied mathematics, statistics, logic"
"Physics": "Classical, quantum, astrophysics, particle physics, etc."
"Biology": "Molecular biology, ecology, genetics, evolution, etc."
"Medicine": "Clinical medicine, pharmacology, public health, etc."
"Engineering": "Mechanical, electrical, civil, chemical engineering, etc."
"Social Science": "Psychology, sociology, political science, anthropology, etc."
"Humanities": "History, philosophy, literature, linguistics, etc."
"Law": "Legal theory, constitutional law, international law, etc."
"Economics": "Micro/macroeconomics, finance, econometrics, etc."
- annotation_type: radio
name: subject_scope
description: "Does this text belong to a single subject or multiple subjects?"
labels:
- "Single Subject"
- "Multi-Subject"
- "Unclear"
keyboard_shortcuts:
"Single Subject": "1"
"Multi-Subject": "2"
"Unclear": "3"
tooltips:
"Single Subject": "The text clearly belongs to one academic discipline"
"Multi-Subject": "The text spans multiple academic disciplines"
"Unclear": "The subject classification is ambiguous or hard to determine"
annotation_instructions: |
You will be shown an academic text passage with its title. Your tasks are:
1. Read the passage carefully and identify the subject area(s).
2. Select all applicable subject categories from the list.
3. Indicate whether the text is primarily about one subject or multiple subjects.
html_layout: |
<div style="padding: 15px; max-width: 800px; margin: auto;">
<div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #a16207;">Title:</strong>
<p style="font-size: 17px; font-weight: 600; line-height: 1.5; margin: 8px 0 0 0;">{{title}}</p>
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Text:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "subj_001",
"text": "We propose a novel transformer architecture for protein folding prediction that achieves state-of-the-art results on the CASP14 benchmark. Our model combines attention mechanisms with geometric deep learning to capture spatial relationships between amino acid residues.",
"title": "Deep Learning Approaches to Protein Structure Prediction"
},
{
"id": "subj_002",
"text": "This paper examines the impact of monetary policy on income inequality across OECD countries from 2000 to 2020. Using panel data regression with fixed effects, we find that expansionary monetary policy disproportionately benefits asset holders.",
"title": "Monetary Policy and Income Inequality: A Cross-Country Analysis"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/semeval/2025/task05-llms4subjects potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
ADMIRE - Multimodal Idiomaticity Recognition
Multimodal idiomaticity detection task requiring annotators to identify whether expressions are used idiomatically or literally, with supporting cue analysis. Based on SemEval-2025 Task 1 (ADMIRE).
Food Hazard Detection
Food safety hazard detection task requiring annotators to identify hazards, products, and risk levels in food incident reports, and classify the type of contamination. Based on SemEval-2025 Task 9.
MAMI - Multimedia Automatic Misogyny Identification
Detection and fine-grained classification of misogynistic content in memes, combining text and image description analysis. Sub-types include stereotyping, shaming, objectification, and violence. Based on SemEval-2022 Task 5 (Fersini et al.).