Skip to content
Showcase/BIG-Bench Task Evaluation
intermediateevaluation

BIG-Bench Task Evaluation

Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.

Q1: Rate your experience12345Q2: Primary use case?ResearchIndustryEducationQ3: Additional feedback

Configuration Fileconfig.yaml

# BIG-Bench Task Evaluation
# Based on Srivastava et al., TMLR 2023
# Paper: https://arxiv.org/abs/2206.04615
# Dataset: https://github.com/google/BIG-bench
#
# Evaluate language model responses on diverse reasoning tasks from the
# BIG-Bench benchmark. Annotators assess correctness of model outputs,
# provide reasoning explanations, and rate their confidence.
#
# Correctness Levels:
# - Correct: The model response is fully accurate
# - Partially Correct: Some aspects are right but incomplete or imprecise
# - Incorrect: The model response is wrong
#
# Annotation Guidelines:
# 1. Read the task prompt and understand what is being asked
# 2. Review the model response carefully
# 3. Assess correctness based on the task requirements
# 4. Explain your reasoning for the correctness judgment
# 5. Rate your confidence in the assessment (0-100)

annotation_task_name: "BIG-Bench Task Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: radio
    name: correctness
    description: "Is the model response correct for this task?"
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
    keyboard_shortcuts:
      "Correct": "1"
      "Partially Correct": "2"
      "Incorrect": "3"
    tooltips:
      "Correct": "The model response is fully accurate and complete"
      "Partially Correct": "Some aspects are right but the response is incomplete or contains minor errors"
      "Incorrect": "The model response is wrong or fundamentally flawed"

  - annotation_type: text
    name: reasoning
    description: "Explain your reasoning for the correctness judgment. What did the model get right or wrong?"

  - annotation_type: number
    name: confidence
    description: "Confidence score (0-100)"

annotation_instructions: |
  You will evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark.

  For each item:
  1. Read the task prompt carefully to understand what is being asked.
  2. Note the task category for context on expected reasoning.
  3. Review the model response and assess its correctness.
  4. Explain your reasoning -- what did the model get right or wrong?
  5. Rate your confidence in your assessment from 0 (no confidence) to 100 (completely certain).

  Task categories include logical reasoning, language understanding, mathematics, common sense, and more.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #f5f3ff; border: 1px solid #c4b5fd; border-radius: 8px; padding: 12px; margin-bottom: 12px;">
      <strong style="color: #6d28d9;">Task Category:</strong>
      <span style="font-size: 14px; background: #ede9fe; padding: 2px 8px; border-radius: 4px;">{{task_category}}</span>
    </div>
    <div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #a16207;">Task Prompt:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #0369a1;">Model Response:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{model_response}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

[
  {
    "id": "bb_001",
    "text": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
    "model_response": "Yes, we can conclude that some roses fade quickly because all roses are flowers and some flowers fade quickly.",
    "task_category": "logical_reasoning"
  },
  {
    "id": "bb_002",
    "text": "What is the next number in the sequence: 2, 6, 12, 20, 30, ?",
    "model_response": "The next number is 42. The differences between consecutive terms are 4, 6, 8, 10, so the next difference is 12, giving 30 + 12 = 42.",
    "task_category": "mathematical_reasoning"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/big-bench-task-eval
potato start config.yaml

Details

Annotation Types

radiotextnumber

Domain

NLPAI Evaluation

Use Cases

Model EvaluationReasoning AssessmentBenchmark Annotation

Tags

big-benchreasoningevaluationlanguage-modelbenchmarktmlr2023

Found an issue or want to improve this design?

Open an Issue