MMLU Knowledge Evaluation

Multiple-choice knowledge evaluation across diverse academic subjects, based on the Massive Multitask Language Understanding benchmark (Hendrycks et al., ICLR 2021). Annotators select the correct answer from four options and provide an explanation.

Configuration Fileconfig.yaml

yaml

# MMLU Knowledge Evaluation
# Based on Hendrycks et al., ICLR 2021
# Paper: https://arxiv.org/abs/2009.03300
# Dataset: https://huggingface.co/datasets/cais/mmlu
#
# Multiple-choice knowledge evaluation based on the Massive Multitask
# Language Understanding benchmark. Each question covers one of 57
# academic subjects spanning STEM, humanities, social sciences, and more.
# Annotators select the correct answer and provide an explanation.
#
# Answer Options:
# - A, B, C, D: Four possible answers; exactly one is correct
#
# Annotation Guidelines:
# 1. Read the question carefully
# 2. Consider all four answer options before selecting
# 3. Choose the single best answer
# 4. Provide a brief explanation of your reasoning

annotation_task_name: "MMLU Knowledge Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  # Step 1: Select the correct answer
  - annotation_type: radio
    name: answer
    description: "Select the correct answer to this question."
    labels:
      - "A"
      - "B"
      - "C"
      - "D"
    keyboard_shortcuts:
      "A": "1"
      "B": "2"
      "C": "3"
      "D": "4"
    tooltips:
      "A": "Select option A"
      "B": "Select option B"
      "C": "Select option C"
      "D": "Select option D"

  # Step 2: Explanation
  - annotation_type: text
    name: explanation
    description: "Briefly explain why you chose this answer."
    textarea: true
    required: false
    placeholder: "Explain your reasoning..."

annotation_instructions: |
  You will answer multiple-choice knowledge questions from the MMLU benchmark.

  For each item:
  1. Read the question and note the subject area.
  2. Read all four answer options (A, B, C, D) carefully.
  3. Select the single correct answer.
  4. Optionally, provide a brief explanation of your reasoning.

  Tips:
  - Questions span many subjects; use your best knowledge.
  - Eliminate clearly wrong options first to narrow your choice.
  - If unsure, make your best guess rather than skipping.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #e8eaf6; padding: 8px 15px; border-radius: 8px; margin-bottom: 16px;">
      <strong>Subject:</strong> {{subject}}
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Question:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">A:</strong> {{option_a}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">B:</strong> {{option_b}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">C:</strong> {{option_c}}
      </div>
      <div style="background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 8px; padding: 12px;">
        <strong style="color: #475569;">D:</strong> {{option_d}}
      </div>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "mmlu_001",
    "text": "What is the primary function of mitochondria in eukaryotic cells?",
    "option_a": "Protein synthesis",
    "option_b": "ATP production through cellular respiration",
    "option_c": "DNA replication",
    "option_d": "Lipid storage",
    "subject": "Biology"
  },
  {
    "id": "mmlu_002",
    "text": "In economics, what does GDP stand for?",
    "option_a": "General Domestic Product",
    "option_b": "Gross Domestic Product",
    "option_c": "Global Development Program",
    "option_d": "Gross Development Percentage",
    "subject": "Economics"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/mmlu-knowledge-eval
potato start config.yaml

Details

Annotation Types

radiotext

Domain

NLPAI Evaluation

Use Cases

Knowledge EvaluationLLM BenchmarkingQuestion Answering

Related Designs

BIG-Bench Task Evaluation

Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.

radiotext

Bias Benchmark for QA (BBQ)

Annotate question-answering examples designed to probe social biases. Based on BBQ (Parrish et al., Findings of ACL 2022). Annotators select the correct answer given a context, assess the direction of bias in the question, categorize the type of bias, and explain their reasoning.

radiotext

Code Generation Evaluation (HumanEval)

Evaluation of LLM-generated code based on the HumanEval benchmark. Annotators assess functional correctness, code quality, and efficiency of generated Python functions, and provide explanations of errors and improvement suggestions, supporting research in code generation and LLM evaluation.

radiotext

MMLU Knowledge Evaluation

Configuration Fileconfig.yaml

Sample Datasample-data.json

Get This Design

Details

Annotation Types

Domain

Use Cases

Tags

Related Designs

BIG-Bench Task Evaluation

Bias Benchmark for QA (BBQ)

Code Generation Evaluation (HumanEval)