BIG-Bench Task Evaluation
Evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark. Annotators assess correctness, provide reasoning explanations, and rate confidence for model outputs across multiple task categories.
Configuration Fileconfig.yaml
# BIG-Bench Task Evaluation
# Based on Srivastava et al., TMLR 2023
# Paper: https://arxiv.org/abs/2206.04615
# Dataset: https://github.com/google/BIG-bench
#
# Evaluate language model responses on diverse reasoning tasks from the
# BIG-Bench benchmark. Annotators assess correctness of model outputs,
# provide reasoning explanations, and rate their confidence.
#
# Correctness Levels:
# - Correct: The model response is fully accurate
# - Partially Correct: Some aspects are right but incomplete or imprecise
# - Incorrect: The model response is wrong
#
# Annotation Guidelines:
# 1. Read the task prompt and understand what is being asked
# 2. Review the model response carefully
# 3. Assess correctness based on the task requirements
# 4. Explain your reasoning for the correctness judgment
# 5. Rate your confidence in the assessment (0-100)
annotation_task_name: "BIG-Bench Task Evaluation"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
port: 8000
server_name: localhost
annotation_schemes:
- annotation_type: radio
name: correctness
description: "Is the model response correct for this task?"
labels:
- "Correct"
- "Partially Correct"
- "Incorrect"
keyboard_shortcuts:
"Correct": "1"
"Partially Correct": "2"
"Incorrect": "3"
tooltips:
"Correct": "The model response is fully accurate and complete"
"Partially Correct": "Some aspects are right but the response is incomplete or contains minor errors"
"Incorrect": "The model response is wrong or fundamentally flawed"
- annotation_type: text
name: reasoning
description: "Explain your reasoning for the correctness judgment. What did the model get right or wrong?"
- annotation_type: number
name: confidence
description: "Confidence score (0-100)"
annotation_instructions: |
You will evaluate language model responses on diverse reasoning tasks from the BIG-Bench benchmark.
For each item:
1. Read the task prompt carefully to understand what is being asked.
2. Note the task category for context on expected reasoning.
3. Review the model response and assess its correctness.
4. Explain your reasoning -- what did the model get right or wrong?
5. Rate your confidence in your assessment from 0 (no confidence) to 100 (completely certain).
Task categories include logical reasoning, language understanding, mathematics, common sense, and more.
html_layout: |
<div style="padding: 15px; max-width: 800px; margin: auto;">
<div style="background: #f5f3ff; border: 1px solid #c4b5fd; border-radius: 8px; padding: 12px; margin-bottom: 12px;">
<strong style="color: #6d28d9;">Task Category:</strong>
<span style="font-size: 14px; background: #ede9fe; padding: 2px 8px; border-radius: 4px;">{{task_category}}</span>
</div>
<div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #a16207;">Task Prompt:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
</div>
<div style="background: #f0f9ff; border: 1px solid #bae6fd; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
<strong style="color: #0369a1;">Model Response:</strong>
<p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{model_response}}</p>
</div>
</div>
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false
Sample Datasample-data.json
[
{
"id": "bb_001",
"text": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
"model_response": "Yes, we can conclude that some roses fade quickly because all roses are flowers and some flowers fade quickly.",
"task_category": "logical_reasoning"
},
{
"id": "bb_002",
"text": "What is the next number in the sequence: 2, 6, 12, 20, 30, ?",
"model_response": "The next number is 42. The differences between consecutive terms are 4, 6, 8, 10, so the next difference is 12, giving 30 + 12 = 42.",
"task_category": "mathematical_reasoning"
}
]
// ... and 8 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/evaluation/big-bench-task-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
GPQA - Graduate-Level Expert QA Evaluation
Expert-level question answering evaluation on graduate-level science questions from the GPQA benchmark (Rein et al., ICLR 2024). Questions span physics, chemistry, and biology, designed to be answerable only by domain experts.
HumanEval Code Correctness Evaluation
Evaluation of code generation correctness based on the HumanEval benchmark (Chen et al., arXiv 2021). Annotators assess whether AI-generated code solutions are correct, provide code review comments, and rate code quality on a numeric scale.
MMLU Knowledge Evaluation
Multiple-choice knowledge evaluation across diverse academic subjects, based on the Massive Multitask Language Understanding benchmark (Hendrycks et al., ICLR 2021). Annotators select the correct answer from four options and provide an explanation.