MAST Failure Taxonomy

Annotate multi-agent system traces to identify failure modes from the MAST taxonomy, rate severity, pinpoint the first failure step, and describe the failure mechanism.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# MAST Failure Taxonomy
# Based on "Why Do Multi-Agent LLM Systems Fail?" (Cemri et al., arXiv 2025)
# Task: Classify failure modes in multi-agent LLM system traces using the MAST taxonomy

annotation_task_name: "MAST Failure Taxonomy"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

html_layout: |
  <div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
    <div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 8px 0; color: #1a5276;">Task</h3>
      <p style="margin: 0; font-size: 15px;">{{text}}</p>
      <span style="display: inline-block; margin-top: 8px; background: #8e44ad; color: #fff; padding: 3px 10px; border-radius: 12px; font-size: 12px;">{{agent_type}}</span>
    </div>
    <div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
      <h3 style="margin: 0 0 10px 0; color: #2c3e50;">Multi-Agent Trace</h3>
      <div style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{trajectory}}</div>
    </div>
    <div style="background: #fdedec; border: 1px solid #e74c3c; padding: 14px; border-radius: 8px;">
      <h4 style="margin: 0 0 8px 0; color: #922b21;">Outcome</h4>
      <p style="margin: 0; font-size: 14px;">{{outcome}}</p>
    </div>
  </div>

annotation_schemes:
  - name: failure_modes
    description: "Select all failure modes from the MAST taxonomy that apply to this trace."
    annotation_type: multiselect
    labels:
      - "Specification Ambiguity"
      - "Task Decomposition Error"
      - "Resource Misallocation"
      - "Role Confusion"
      - "Miscommunication"
      - "Information Withholding"
      - "Conflicting Actions"
      - "Cascading Errors"
      - "Premature Termination"
      - "Infinite Loop"
      - "Hallucinated Action"
      - "Tool Misuse"
      - "Verification Failure"
      - "No Failure Detected"

  - name: severity
    description: "How severe is the overall failure in this trace?"
    annotation_type: radio
    labels:
      - "Critical"
      - "Major"
      - "Minor"
      - "None"
    keyboard_shortcuts:
      "Critical": "1"
      "Major": "2"
      "Minor": "3"
      "None": "4"

  - name: first_failure_step
    description: "At which step did the first failure occur?"
    annotation_type: radio
    labels:
      - "Step 1"
      - "Step 2"
      - "Step 3"
      - "Step 4"
      - "Step 5"
      - "Step 6"
      - "No Failure"

  - name: failure_description
    description: "Explain the failure: what went wrong, why, and how it affected the outcome."
    annotation_type: text

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2

Sample Datasample-data.json

json

[
  {
    "id": "mast-001",
    "text": "Write a Python web scraper that extracts product prices from three e-commerce sites and saves results to a CSV file.",
    "agent_type": "CrewAI",
    "trajectory": "[Planner Agent] Step 1: Decompose task into subtasks: (1) build scraper for site A, (2) build scraper for site B, (3) build scraper for site C, (4) merge results into CSV.\n\n[Coder Agent A] Step 2: Writing scraper for site A using BeautifulSoup... Code generated. Extracts prices from div.price-tag elements.\n\n[Coder Agent B] Step 3: Writing scraper for site B using Selenium... Code generated. Extracts prices from span.cost elements.\n\n[Coder Agent C] Step 4: Writing scraper for site C using BeautifulSoup... Code generated. Extracts prices from div.price-tag elements. (Note: Copied structure from Agent A but site C uses span.product-price)\n\n[Merger Agent] Step 5: Combining outputs into CSV... Agent A output: 45 prices. Agent B output: 38 prices. Agent C output: 0 prices (empty list).\n\n[Merger Agent] Step 6: CSV generated with data from sites A and B only. Reports task as complete.",
    "outcome": "PARTIAL FAILURE: CSV file created but missing all data from site C. Agent C copied incorrect selectors from Agent A without adapting to site C's HTML structure. Merger Agent did not flag the empty results."
  },
  {
    "id": "mast-002",
    "text": "Collaboratively write a research literature review on transformer architectures, covering attention mechanisms, efficiency improvements, and applications.",
    "agent_type": "AutoGen",
    "trajectory": "[Coordinator] Step 1: Assigning sections - Writer A: attention mechanisms, Writer B: efficiency improvements, Writer C: applications.\n\n[Writer A] Step 2: Drafting attention section... Covers self-attention, multi-head attention, cross-attention. 800 words generated.\n\n[Writer B] Step 3: Drafting efficiency section... Covers sparse attention, linear attention, FlashAttention. Also includes a paragraph on self-attention basics (duplicating Writer A's content).\n\n[Writer C] Step 4: Drafting applications section... Covers NLP, vision, and speech. References 'the efficient attention methods discussed above' but Writer B's section will appear after Writer C's in final document.\n\n[Editor Agent] Step 5: Merging sections in order: A, C, B. Does not notice the ordering creates a forward reference issue in section C.\n\n[Coordinator] Step 6: Final review - approves document. Total: 2,400 words.",
    "outcome": "FAILURE: Document has duplicated content between sections A and B, and section C references content that appears later in the document. Editor did not catch structural issues."
  }
]

// ... and 6 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/agentic/mast-failure-taxonomy
potato start config.yaml

Dataset & paper

Cemri et al., arXiv 2025

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{cemri2025mast,
  title={Why Do Multi-Agent LLM Systems Fail?},
  author={Cemri, Mert and Pan, Melissa Z. and Yang, Shuyi and Agrawal, Lakshya A. and Chopra, Bhavya and Tiwari, Rishabh and Keutzer, Kurt and Parameswaran, Aditya and Klein, Dan and Ramchandran, Kannan and Zaharia, Matei and Gonzalez, Joseph E. and Stoica, Ion},
  journal={arXiv preprint arXiv:2503.13657},
  year={2025}
}

Details

Annotation Types

multiselectradiotext

Domain

Agentic AIMulti-Agent SystemsFailure Analysis

Use Cases

Failure ClassificationSystem Debugging

Related Designs

AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

multiselectlikert

tau-bench Agent Evaluation

Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.

radiolikert

DocBank Document Layout Detection

Document layout analysis benchmark (Li et al., COLING 2020). Detect and classify document elements including titles, abstracts, paragraphs, figures, tables, and captions.

multiselectradio