MMLU-Pro - Tiered Multi-Subject Evaluation

Tiered evaluation for multi-subject question answering, based on MMLU-Pro (Wang et al., NeurIPS 2024). Annotators verify answers to challenging 10-option multiple choice questions across STEM and humanities subjects, using a tiered annotation scheme for topic and subtopic categorization.

ملف الإعدادconfig.yaml

# MMLU-Pro - Tiered Multi-Subject Evaluation
# Based on Wang et al., NeurIPS 2024
# Paper: https://arxiv.org/abs/2406.01574
# Dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
#
# MMLU-Pro extends MMLU with 10 answer options (A-J) instead of 4,
# making it significantly more challenging. This task uses a tiered
# annotation scheme to categorize questions by topic and subtopic,
# alongside answer selection.
#
# The tiered annotation allows organizing questions hierarchically:
# - Topic: The broad subject area (e.g., Biology, Physics, History)
# - Subtopic: A more specific area within the topic
#
# Answer options: A through J (10 choices per question)

annotation_task_name: "MMLU-Pro: Tiered Multi-Subject Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: tiered_annotation
    name: subject_classification
    description: "Classify the question by topic and subtopic using the tiered hierarchy"
    source_field: "audio_url"
    media_type: "audio"
    tiers:
      - name: "topic"
        tier_type: "independent"
      - name: "subtopic"
        tier_type: "dependent"
        parent_tier: "topic"
        constraint_type: "symbolic_association"

  - annotation_type: radio
    name: correct_answer
    description: "Select the correct answer from the 10 options (A-J)"
    labels:
      - "A"
      - "B"
      - "C"
      - "D"
      - "E"
      - "F"
      - "G"
      - "H"
      - "I"
      - "J"
    keyboard_shortcuts:
      "A": "1"
      "B": "2"
      "C": "3"
      "D": "4"
      "E": "5"
      "F": "6"
      "G": "7"
      "H": "8"
      "I": "9"
      "J": "0"

annotation_instructions: |
  You will evaluate challenging multiple-choice questions from MMLU-Pro.
  1. Read the question and all 10 answer options carefully.
  2. Classify the question by topic and subtopic using the tiered scheme.
  3. Select the single correct answer (A through J).
  4. These questions are intentionally difficult and may require expert knowledge.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <span style="display: inline-block; background: #0369a1; color: white; padding: 2px 10px; border-radius: 12px; font-size: 13px; margin-bottom: 8px;">{{subject}}</span>
      <p style="font-size: 16px; font-weight: 600; line-height: 1.6; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #475569;">Answer Options:</strong>
      <p style="font-size: 15px; line-height: 1.8; margin: 8px 0 0 0; white-space: pre-line;">{{options}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

بيانات نموذجيةsample-data.json

[
  {
    "id": "mmlu_pro_001",
    "text": "Which of the following best describes the role of topoisomerase II in DNA replication?",
    "options": "A. It unwinds the double helix ahead of the replication fork\nB. It synthesizes RNA primers for Okazaki fragments\nC. It relieves positive supercoiling by making transient double-strand breaks\nD. It joins Okazaki fragments on the lagging strand\nE. It proofreads newly synthesized DNA\nF. It degrades RNA primers after replication\nG. It adds telomeric sequences to chromosome ends\nH. It methylates newly synthesized DNA strands\nI. It prevents re-replication by licensing origins\nJ. It stabilizes single-stranded DNA at the replication fork",
    "subject": "Biology",
    "audio_url": ""
  },
  {
    "id": "mmlu_pro_002",
    "text": "A projectile is launched at an angle of 60 degrees above the horizontal with an initial speed of 50 m/s. Ignoring air resistance, what is the maximum height reached by the projectile?",
    "options": "A. 45.9 m\nB. 55.7 m\nC. 63.8 m\nD. 76.5 m\nE. 85.3 m\nF. 95.7 m\nG. 102.4 m\nH. 110.2 m\nI. 127.6 m\nJ. 143.1 m",
    "subject": "Physics",
    "audio_url": ""
  }
]

// ... and 8 more items

احصل على هذا التصميم

View on GitHub

Clone or download from the repository

بدء سريع:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/preference-learning/mmlu-pro-tiered-eval
potato start config.yaml

التفاصيل

أنواع التوسيم

tiered_annotationradio

المجال

NLPEducation

حالات الاستخدام

Question AnsweringBenchmark EvaluationMulti-Subject Assessment

الوسوم

mmlu-promultiple-choicetieredevaluationstemhumanitiesneurips2024

وجدت مشكلة أو تريد تحسين هذا التصميم؟

افتح مشكلة

تصاميم ذات صلة

MathDial - Tutoring Dialogue Quality Annotation

Annotate math tutoring dialogues for guidance correctness, tutoring strategies, and key concepts, based on the MathDial dataset (Macina et al., Findings ACL 2023). Supports evaluation of AI-generated tutoring interactions for K-12 math problems.

radiomultiselect

Student Essay Discourse Element Classification

Discourse element annotation of student essays based on Song et al. (COLING 2020). Annotators identify argumentative discourse units, classify essay types, and tag rhetorical strategies used in student writing.

spanradio

#HashtagWars - Learning a Sense of Humor

Humor ranking of tweets submitted to Comedy Central's @midnight #HashtagWars, classifying comedic quality. Based on SemEval-2017 Task 6.

radio