Showcase/IFEval: Instruction-Following Evaluation Benchmark for LLMs

intermediateevaluation

IFEval: Instruction-Following Evaluation Benchmark for LLMs

IFEval is the instruction-following benchmark from Google Research with 541 prompts built around 25 verifiable instruction types (word counts, formatting, keywords, structure). Scored by prompt-level and instruction-level accuracy, strict and loose. Reproduce the labeling in Potato.

About this dataset

IFEval (Instruction-Following Eval) is a benchmark that tests whether large language models obey explicit, machine-checkable instructions. It was introduced by Zhou et al. (Google Research) in arXiv 2023.

The dataset has 541 prompts drawn from 25 types of verifiable instructions. Each prompt carries one or more constraints, for example 'write more than 400 words' or 'mention the keyword AI at least 3 times', so compliance can be checked by a program rather than a human rater.

Scoring uses four numbers: prompt-level and instruction-level accuracy, each in a strict and a loose variant. Prompt-level counts a response correct only when every constraint is met; instruction-level counts the fraction of individual constraints satisfied. The loose variant tolerates minor formatting differences.

The Potato config below reproduces the IFEval judging task: a constraint checklist with pass/fail per instruction plus an overall verdict. Use it to collect human verification of model outputs or to adapt the constraint scheme to your own instruction set.

Released: 2023 (Google Research)
Prompts: 541
Instruction types: 25 verifiable
Metrics: Prompt-level and instruction-level, strict and loose
Task: Verifiable instruction following
Paper: Zhou et al. 2023 (arXiv 2311.07911)

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# IFEval: Instruction-Following Evaluation for Large Language Models
# Based on "Instruction-Following Evaluation for Large Language Models" (Zhou et al., arXiv 2023)
# Task: Evaluate whether LLM responses satisfy verifiable instruction constraints

annotation_task_name: "IFEval Instruction-Following Evaluation"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "instruction"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Display layout showing instruction, constraints, and model response
html_layout: |
  <div class="ifeval-container" style="font-family: Arial, sans-serif; max-width: 900px; margin: 0 auto;">
    <div class="instruction-section" style="background: #e8f5e9; padding: 18px; border-radius: 8px; margin-bottom: 16px; border-left: 5px solid #4caf50;">
      <h3 style="margin-top: 0; color: #2e7d32;">Instruction / Prompt</h3>
      <div style="font-size: 15px; line-height: 1.6; white-space: pre-wrap;">{{instruction}}</div>
    </div>
    <div class="constraints-section" style="background: #fff3e0; padding: 18px; border-radius: 8px; margin-bottom: 16px; border-left: 5px solid #ff9800;">
      <h3 style="margin-top: 0; color: #e65100;">Verifiable Constraints</h3>
      <div style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{constraints}}</div>
    </div>
    <div class="response-section" style="background: #e3f2fd; padding: 18px; border-radius: 8px; margin-bottom: 16px; border-left: 5px solid #1976d2;">
      <h3 style="margin-top: 0; color: #1565c0;">Model Response <span style="font-size: 12px; font-weight: normal; color: #666;">({{model_name}})</span></h3>
      <div style="font-size: 14px; line-height: 1.6; white-space: pre-wrap;">{{model_response}}</div>
    </div>
  </div>

# Annotation schemes
annotation_schemes:
  # Overall pass/fail for instruction following
  - name: "overall_pass"
    description: "Does the model response satisfy ALL verifiable constraints in the instruction?"
    annotation_type: radio
    labels:
      - name: "pass"
        tooltip: "The response satisfies all listed constraints"
        key_value: "p"
      - name: "fail"
        tooltip: "The response fails to satisfy one or more constraints"
        key_value: "f"

  # Which specific constraints were satisfied
  - name: "constraints_satisfied"
    description: "Select ALL constraints that the model response successfully satisfies."
    annotation_type: multiselect
    labels:
      - name: "length-constraint"
        tooltip: "Word count, sentence count, paragraph count, or character limits"
      - name: "format-constraint"
        tooltip: "Output format such as JSON, bullet points, numbered lists, markdown"
      - name: "keyword-constraint"
        tooltip: "Required inclusion or exclusion of specific words or phrases"
      - name: "language-constraint"
        tooltip: "Language, register, or tone requirements"
      - name: "style-constraint"
        tooltip: "Writing style such as formal, casual, poetic, technical"
      - name: "structural-constraint"
        tooltip: "Structural requirements like sections, headings, or specific organization"
      - name: "content-constraint"
        tooltip: "Topical requirements, specific information to include or exclude"
      - name: "punctuation-constraint"
        tooltip: "Specific punctuation rules such as no commas, end with a period"

  # Overall response quality beyond just constraint satisfaction
  - name: "response_quality"
    description: "How would you rate the overall quality of the response, independent of constraint satisfaction?"
    annotation_type: radio
    labels:
      - name: "excellent"
        tooltip: "High quality, informative, well-written response"
        key_value: "1"
      - name: "good"
        tooltip: "Adequate quality, mostly well-written"
        key_value: "2"
      - name: "acceptable"
        tooltip: "Passable quality, gets the point across"
        key_value: "3"
      - name: "poor"
        tooltip: "Low quality, unclear, or unhelpful response"
        key_value: "4"

  # Free-text explanation of failures
  - name: "failure_explanation"
    description: "If the response fails any constraint, explain which constraints failed and why."
    annotation_type: text

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2

# Detailed annotation instructions
annotation_instructions: |
  ## IFEval: Instruction-Following Evaluation

  You are evaluating whether a large language model correctly followed verifiable
  instructions with specific constraints.

  ### Your Tasks:

  1. **Read the instruction** carefully, paying attention to all specific constraints.
  2. **Review the listed constraints** -- these are the verifiable requirements.
  3. **Read the model response** and check each constraint one by one.
  4. **Mark overall pass/fail**: "pass" only if ALL constraints are satisfied.
  5. **Select satisfied constraints**: Check each constraint type that was met.
  6. **Rate response quality**: Assess the overall quality beyond constraint adherence.
  7. **Explain failures**: If any constraint failed, describe what went wrong.

  ### Constraint Types:
  - **Length**: "Write exactly 3 paragraphs", "Use at most 50 words"
  - **Format**: "Respond in JSON format", "Use bullet points"
  - **Keyword**: "Include the word 'therefore' at least twice"
  - **Language**: "Write in formal English", "Avoid slang"
  - **Style**: "Write in a poetic style", "Use technical language"
  - **Structural**: "Include a title and three sections"
  - **Content**: "Do not mention any specific brand names"
  - **Punctuation**: "Do not use any commas", "End every sentence with an exclamation mark"

  ### Important Notes:
  - Be strict: if a constraint says "exactly 3 paragraphs" and there are 4, it fails.
  - Check constraints literally -- do not interpret them loosely.
  - A response can be high quality but still fail constraints (or vice versa).

Sample Datasample-data.json

json

[
  {
    "id": "ifeval_001",
    "instruction": "Write a brief summary of photosynthesis. Your response must be exactly 3 paragraphs. Each paragraph must have at least 3 sentences. Include the word 'energy' at least 4 times.",
    "constraints": "1. Exactly 3 paragraphs\n2. Each paragraph has at least 3 sentences\n3. The word 'energy' appears at least 4 times",
    "model_response": "Photosynthesis is the process by which green plants and certain other organisms convert light energy into chemical energy. This remarkable biological mechanism takes place primarily in the chloroplasts of plant cells. It is one of the most important processes on Earth, as it provides the foundation for nearly all food chains.\n\nDuring photosynthesis, plants absorb sunlight energy through chlorophyll, the green pigment in their leaves. They use this energy along with carbon dioxide from the air and water from the soil to produce glucose. The chemical energy stored in glucose molecules serves as fuel for the plant's growth and metabolic activities.\n\nThe byproduct of photosynthesis is oxygen, which is essential for most life on Earth. Without photosynthesis, the energy that sustains ecosystems would not be available. Understanding this process is crucial for addressing challenges related to food production and renewable energy sources.",
    "model_name": "GPT-4"
  },
  {
    "id": "ifeval_002",
    "instruction": "Explain what a black hole is. Do not use the letter 'e' anywhere in your response. Keep your response under 50 words.",
    "constraints": "1. No occurrences of the letter 'e'\n2. Total word count under 50",
    "model_response": "A black gap in our cosmos is a spot with gravity so strong nothing can fly out, not light or any physical thing. Stars that fall apart and crush down into a tiny point can form such gaps. Such cosmic formations warp all around.",
    "model_name": "Claude-3"
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/evaluation/ifeval-instruction-following
potato start config.yaml

Dataset & paper

Zhou et al., arXiv 2023

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{zhou2023instruction,
  title={Instruction-Following Evaluation for Large Language Models},
  author={Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le},
  journal={arXiv preprint arXiv:2311.07911},
  year={2023}
}

Details

Annotation Types

radiomultiselecttext

Domain

AI EvaluationInstruction FollowingLLM Assessment

Use Cases

Instruction-Following EvaluationLLM BenchmarkingConstraint Verification

Related Designs

AndroidWorld: Mobile Agent Task Evaluation

Evaluation of autonomous agents performing tasks in dynamic Android environments. Annotators assess task completion, identify interaction types and touch gestures, evaluate UI understanding, and describe mobile-specific issues across diverse Android apps.

radiomultiselect

AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

multiselectlikert

Audio Transcription Review

Review and correct automatic speech recognition transcriptions with waveform visualization.

likertmultiselect