BIG-Bench Task Evaluation

Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# BIG-Bench Task Evaluation
# Based on Srivastava et al., TMLR 2023
# Paper: https://arxiv.org/abs/2206.04615
# Dataset: https://github.com/google/BIG-bench
#
# Evaluate language model responses on diverse reasoning tasks from the
# BIG-Bench benchmark. Annotators assess correctness of model outputs,
# provide reasoning explanations, and rate their confidence.
#
# Correctness Levels:
# - Correct: The model response is fully accurate
# - Partially Correct: Some aspects are right but incomplete or imprecise
# - Incorrect: The model response is wrong
#
# Annotation Guidelines:
# 1. Read the task prompt and understand what is being asked
# 2. Review the model response carefully
# 3. Assess correctness based on the task requirements
# 4. Explain your reasoning for the correctness judgment
# 5. Rate your confidence in the assessment (0-100)

annotation_task_name: "BIG-Bench Task Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: radio
    name: correctness
    description: "Is the model response correct for this task?"
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
    keyboard_shortcuts:
      "Correct": "1"
      "Partially Correct": "2"
      "Incorrect": "3"
    tooltips:
      "Correct": "The model response is fully accurate and complete"
      "Partially Correct": "Some aspects are right but the response is incomplete or contains minor errors"
      "Incorrect": "The model response is wrong or fundamentally flawed"

  - annotation_type: text
    name: reasoning
    description: "Explain your reasoning for the correctness judgment. What did the model get right or wrong?"

  - annotation_type: number
    name: confidence
    description: "Confidence score (0-100)"

annotation_instructions: |
  You will evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark.

  For each item:
  1. Read the task prompt carefully to understand what is being asked.
  2. Note the task category for context on expected reasoning.
  3. Review the model response and assess its correctness.
  4. Explain your reasoning -- what did the model get right or wrong?
  5. Rate your confidence in your assessment from 0 (no confidence) to 100 (completely certain).

  Task categories include logical reasoning, language understanding, mathematics, common sense, and more.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #f5f3ff; border: 1px solid #c4b5fd; border-radius: 8px; padding: 12px; margin-bottom: 12px;">
      <strong style="color: #6d28d9;">Task Category:</strong>
      <span style="font-size: 14px; background: #ede9fe; padding: 2px 8px; border-radius: 4px;">{{task_category}}</span>
    </div>
    <div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #a16207;">Task Prompt:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Model Response:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{model_response}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "bb_001",
    "text": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
    "model_response": "Yes, we can conclude that some roses fade quickly because all roses are flowers and some flowers fade quickly.",
    "task_category": "logical_reasoning"
  },
  {
    "id": "bb_002",
    "text": "What is the next number in the sequence: 2, 6, 12, 20, 30, ?",
    "model_response": "The next number is 42. The differences between consecutive terms are 4, 6, 8, 10, so the next difference is 12, giving 30 + 12 = 42.",
    "task_category": "mathematical_reasoning"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/big-bench-task-eval
potato start config.yaml

Dataset & paper

Srivastava et al., TMLR 2023

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{srivastava2023beyond,
  title={Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models},
  author={Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and Fisch, Adam and Brown, Adam R and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri{\`a} and others},
  journal={Transactions on Machine Learning Research},
  year={2023}
}

Details

Annotation Types

radiotextnumber

Domain

NLPAI Evaluation

Use Cases

Model EvaluationReasoning AssessmentBenchmark Annotation

Related Designs

GPQA - Graduate-Level Expert QA Evaluation

Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., COLM 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.

numberradio

HumanEval Code Correctness Evaluation

Evaluation of code generation correctness based on the HumanEval benchmark (Chen et al., arXiv 2021). Annotators assess whether AI-generated code solutions are correct, provide code review comments, and rate code quality on a numeric scale.

radiotext

MMLU Knowledge Evaluation

Multiple-choice knowledge evaluation across diverse academic subjects, based on the Massive Multitask Language Understanding benchmark (Hendrycks et al., ICLR 2021). Annotators select the correct answer from four options and provide an explanation.