HELM - Model Card Display and Evaluation

Model performance summary display and evaluation based on the HELM benchmark (Liang et al., TMLR 2023). Annotators review model cards summarizing performance across multiple metrics and rate overall model quality.

Fichier de configurationconfig.yaml

# HELM - Model Card Display and Evaluation
# Based on Liang et al., TMLR 2023
# Paper: https://arxiv.org/abs/2211.09110
# Dataset: https://crfm.stanford.edu/helm/
#
# This task displays model performance summaries from the HELM benchmark
# and asks annotators to rate the overall quality of each model based on
# the presented metrics. The pure_display scheme shows the model card,
# while the likert scheme collects quality ratings.
#
# Annotation Guidelines:
# 1. Review the model description and performance summary carefully
# 2. Consider accuracy, robustness, fairness, and efficiency metrics
# 3. Rate the overall model quality on a 5-point scale

annotation_task_name: "HELM - Model Card Display and Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: pure_display
    name: model_card_display
    description: "Model performance summary display"

  - annotation_type: likert
    name: overall_quality
    description: "Rate the overall quality of this model based on the displayed metrics"
    min_label: "Poor"
    max_label: "Excellent"
    size: 5

annotation_instructions: |
  You will be shown a model card summarizing an AI model's performance across
  various HELM benchmark metrics. Your task is to:
  1. Review the model name, description, and metrics summary.
  2. Rate the overall quality of the model on a 5-point scale from Poor to Excellent.

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #eff6ff; border: 1px solid #bfdbfe; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <h3 style="margin: 0 0 8px 0; color: #1e40af;">{{model_name}}</h3>
      <p style="font-size: 15px; line-height: 1.6; color: #374151; margin: 0;">{{text}}</p>
    </div>
    <div style="background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px;">
      <strong style="color: #166534;">Metrics Summary:</strong>
      <p style="font-size: 15px; line-height: 1.7; margin: 8px 0 0 0; color: #374151;">{{metrics_summary}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
allow_skip: true
skip_reason_required: false

Données d'exemplesample-data.json

[
  {
    "id": "helm_001",
    "text": "A 175B parameter autoregressive language model trained on a diverse web corpus. Demonstrates strong few-shot learning across a range of NLP tasks with competitive performance on question answering and text generation benchmarks.",
    "model_name": "GPT-3 (175B)",
    "metrics_summary": "Accuracy: 78.2% | Calibration: 0.72 | Robustness: 65.4% | Fairness: 0.81 | Efficiency: 3.2 tokens/sec | Toxicity: 0.12"
  },
  {
    "id": "helm_002",
    "text": "An instruction-tuned 7B parameter model fine-tuned with RLHF on human preference data. Excels at following complex instructions while maintaining safety alignment, though limited by its smaller parameter count.",
    "model_name": "LLaMA-2-Chat (7B)",
    "metrics_summary": "Accuracy: 64.8% | Calibration: 0.68 | Robustness: 58.1% | Fairness: 0.85 | Efficiency: 45.6 tokens/sec | Toxicity: 0.05"
  }
]

// ... and 8 more items

Obtenir ce design

View on GitHub

Clone or download from the repository

Démarrage rapide :

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/helm-model-card-display
potato start config.yaml

Détails

Types d'annotation

pure_displaylikert

Domaine

NLPEvaluationBenchmarking

Cas d'utilisation

Model EvaluationBenchmark AnalysisModel Comparison

Étiquettes

helmmodel-evaluationbenchmarkinglanguage-modelstmlr2023

Vous avez trouvé un problème ou souhaitez améliorer ce design ?

Ouvrir un ticket

Designs associés

ESA: Error Span Annotation for Machine Translation

Error span annotation for machine translation output. Annotators identify error spans in translations, classify error types (accuracy, fluency, terminology, style), and rate severity.

spanradio

LongEval: Faithfulness Evaluation for Long-form Summarization

Faithfulness evaluation of long-form summaries. Annotators identify atomic content units in summaries, check each against source documents for faithfulness, and rate overall summary quality.

spanradio

Text Summarization Evaluation

Rate the quality of AI-generated summaries on fluency, coherence, and faithfulness.

likerttext