LLMs4Subjects - Automated Subject Tagging

Automated subject classification of academic texts, requiring annotators to assign subject categories and determine whether texts span single or multiple disciplines. Based on SemEval-2025 Task 5.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# LLMs4Subjects - Automated Subject Tagging
# Based on Sinhababu et al., SemEval 2025
# Paper: https://aclanthology.org/volumes/2025.semeval-1/
# Dataset: https://github.com/SemEval/SemEval2025-Task5
#
# This task involves assigning subject categories to academic text
# passages. Annotators select all applicable subject areas and
# indicate whether the text belongs to a single discipline or
# spans multiple fields.
#
# Subject Categories:
# - Computer Science, Mathematics, Physics, Biology, Medicine,
#   Engineering, Social Science, Humanities, Law, Economics
#
# Classification:
# - Single Subject: Text belongs to one clear discipline
# - Multi-Subject: Text spans multiple disciplines
# - Unclear: Subject classification is ambiguous

annotation_task_name: "LLMs4Subjects - Automated Subject Tagging"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: multiselect
    name: subject_categories
    description: "Select all subject areas that apply to this text."
    labels:
      - "Computer Science"
      - "Mathematics"
      - "Physics"
      - "Biology"
      - "Medicine"
      - "Engineering"
      - "Social Science"
      - "Humanities"
      - "Law"
      - "Economics"
    tooltips:
      "Computer Science": "Algorithms, programming, AI, databases, software, etc."
      "Mathematics": "Pure or applied mathematics, statistics, logic"
      "Physics": "Classical, quantum, astrophysics, particle physics, etc."
      "Biology": "Molecular biology, ecology, genetics, evolution, etc."
      "Medicine": "Clinical medicine, pharmacology, public health, etc."
      "Engineering": "Mechanical, electrical, civil, chemical engineering, etc."
      "Social Science": "Psychology, sociology, political science, anthropology, etc."
      "Humanities": "History, philosophy, literature, linguistics, etc."
      "Law": "Legal theory, constitutional law, international law, etc."
      "Economics": "Micro/macroeconomics, finance, econometrics, etc."

  - annotation_type: radio
    name: subject_scope
    description: "Does this text belong to a single subject or multiple subjects?"
    labels:
      - "Single Subject"
      - "Multi-Subject"
      - "Unclear"
    keyboard_shortcuts:
      "Single Subject": "1"
      "Multi-Subject": "2"
      "Unclear": "3"
    tooltips:
      "Single Subject": "The text clearly belongs to one academic discipline"
      "Multi-Subject": "The text spans multiple academic disciplines"
      "Unclear": "The subject classification is ambiguous or hard to determine"

annotation_instructions: |
  You will be shown an academic text passage with its title. Your tasks are:
  1. Read the passage carefully and identify the subject area(s).
  2. Select all applicable subject categories from the list.
  3. Indicate whether the text is primarily about one subject or multiple subjects.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #a16207;">Title:</strong>
      <p style="font-size: 17px; font-weight: 600; line-height: 1.5; margin: 8px 0 0 0;">{{title}}</p>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Text:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "subj_001",
    "text": "We propose a novel transformer architecture for protein folding prediction that achieves state-of-the-art results on the CASP14 benchmark. Our model combines attention mechanisms with geometric deep learning to capture spatial relationships between amino acid residues.",
    "title": "Deep Learning Approaches to Protein Structure Prediction"
  },
  {
    "id": "subj_002",
    "text": "This paper examines the impact of monetary policy on income inequality across OECD countries from 2000 to 2020. Using panel data regression with fixed effects, we find that expansionary monetary policy disproportionately benefits asset holders.",
    "title": "Monetary Policy and Income Inequality: A Cross-Country Analysis"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/semeval/2025/task05-llms4subjects
potato start config.yaml

Dataset & paper

Sinhababu et al., SemEval 2025

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{sinhababu-etal-2025-llms4subjects,
    title = "{LLM}s4Subjects: Automated Subject Tagging",
    author = "Sinhababu, Amit and others",
    booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
    year = "2025",
    publisher = "Association for Computational Linguistics"
}

Details

Annotation Types

multiselectradio

Domain

SemEvalNLPText ClassificationAcademic

Use Cases

Subject TaggingDocument ClassificationLibrary Science

Related Designs

ADMIRE - Multimodal Idiomaticity Recognition

Multimodal idiomaticity detection task requiring annotators to identify whether expressions are used idiomatically or literally, with supporting cue analysis. Based on SemEval-2025 Task 1 (ADMIRE).

radiomultiselect

Food Hazard Detection

Food safety hazard detection task requiring annotators to identify hazards, products, and risk levels in food incident reports, and classify the type of contamination. Based on SemEval-2025 Task 9.

spanradio

Memotion Analysis - Sentiment and Type Classification of Memes

Classify the overall sentiment of internet memes and identify their communicative types (sarcastic, humorous, offensive, motivational), based on SemEval-2020 Task 8 (Sharma et al.). Annotators analyze both text and image descriptions of memes.