MAST Failure Taxonomy
Annotate multi-agent system traces to identify failure modes from the MAST taxonomy, rate severity, pinpoint the first failure step, and describe the failure mechanism.
Configuration Fileconfig.yaml
# MAST Failure Taxonomy
# Based on "Why Do Multi-Agent LLM Systems Fail?" (Cemri et al., arXiv 2025)
# Task: Classify failure modes in multi-agent LLM system traces using the MAST taxonomy
annotation_task_name: "MAST Failure Taxonomy"
task_dir: "."
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
html_layout: |
<div class="container" style="font-family: Arial, sans-serif; max-width: 1000px; margin: 0 auto;">
<div style="background: #e8f4fd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 8px 0; color: #1a5276;">Task</h3>
<p style="margin: 0; font-size: 15px;">{{text}}</p>
<span style="display: inline-block; margin-top: 8px; background: #8e44ad; color: #fff; padding: 3px 10px; border-radius: 12px; font-size: 12px;">{{agent_type}}</span>
</div>
<div style="background: #fafafa; border: 1px solid #ddd; padding: 14px; border-radius: 8px; margin-bottom: 14px;">
<h3 style="margin: 0 0 10px 0; color: #2c3e50;">Multi-Agent Trace</h3>
<div style="font-size: 14px; line-height: 1.8; white-space: pre-wrap;">{{trajectory}}</div>
</div>
<div style="background: #fdedec; border: 1px solid #e74c3c; padding: 14px; border-radius: 8px;">
<h4 style="margin: 0 0 8px 0; color: #922b21;">Outcome</h4>
<p style="margin: 0; font-size: 14px;">{{outcome}}</p>
</div>
</div>
annotation_schemes:
- name: failure_modes
description: "Select all failure modes from the MAST taxonomy that apply to this trace."
annotation_type: multiselect
labels:
- "Specification Ambiguity"
- "Task Decomposition Error"
- "Resource Misallocation"
- "Role Confusion"
- "Miscommunication"
- "Information Withholding"
- "Conflicting Actions"
- "Cascading Errors"
- "Premature Termination"
- "Infinite Loop"
- "Hallucinated Action"
- "Tool Misuse"
- "Verification Failure"
- "No Failure Detected"
- name: severity
description: "How severe is the overall failure in this trace?"
annotation_type: radio
labels:
- "Critical"
- "Major"
- "Minor"
- "None"
keyboard_shortcuts:
"Critical": "1"
"Major": "2"
"Minor": "3"
"None": "4"
- name: first_failure_step
description: "At which step did the first failure occur?"
annotation_type: radio
labels:
- "Step 1"
- "Step 2"
- "Step 3"
- "Step 4"
- "Step 5"
- "Step 6"
- "No Failure"
- name: failure_description
description: "Explain the failure: what went wrong, why, and how it affected the outcome."
annotation_type: text
allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "mast-001",
"text": "Write a Python web scraper that extracts product prices from three e-commerce sites and saves results to a CSV file.",
"agent_type": "CrewAI",
"trajectory": "[Planner Agent] Step 1: Decompose task into subtasks: (1) build scraper for site A, (2) build scraper for site B, (3) build scraper for site C, (4) merge results into CSV.\n\n[Coder Agent A] Step 2: Writing scraper for site A using BeautifulSoup... Code generated. Extracts prices from div.price-tag elements.\n\n[Coder Agent B] Step 3: Writing scraper for site B using Selenium... Code generated. Extracts prices from span.cost elements.\n\n[Coder Agent C] Step 4: Writing scraper for site C using BeautifulSoup... Code generated. Extracts prices from div.price-tag elements. (Note: Copied structure from Agent A but site C uses span.product-price)\n\n[Merger Agent] Step 5: Combining outputs into CSV... Agent A output: 45 prices. Agent B output: 38 prices. Agent C output: 0 prices (empty list).\n\n[Merger Agent] Step 6: CSV generated with data from sites A and B only. Reports task as complete.",
"outcome": "PARTIAL FAILURE: CSV file created but missing all data from site C. Agent C copied incorrect selectors from Agent A without adapting to site C's HTML structure. Merger Agent did not flag the empty results."
},
{
"id": "mast-002",
"text": "Collaboratively write a research literature review on transformer architectures, covering attention mechanisms, efficiency improvements, and applications.",
"agent_type": "AutoGen",
"trajectory": "[Coordinator] Step 1: Assigning sections - Writer A: attention mechanisms, Writer B: efficiency improvements, Writer C: applications.\n\n[Writer A] Step 2: Drafting attention section... Covers self-attention, multi-head attention, cross-attention. 800 words generated.\n\n[Writer B] Step 3: Drafting efficiency section... Covers sparse attention, linear attention, FlashAttention. Also includes a paragraph on self-attention basics (duplicating Writer A's content).\n\n[Writer C] Step 4: Drafting applications section... Covers NLP, vision, and speech. References 'the efficient attention methods discussed above' but Writer B's section will appear after Writer C's in final document.\n\n[Editor Agent] Step 5: Merging sections in order: A, C, B. Does not notice the ordering creates a forward reference issue in section C.\n\n[Coordinator] Step 6: Final review - approves document. Total: 2,400 words.",
"outcome": "FAILURE: Document has duplicated content between sections A and B, and section C references content that appears later in the document. Editor did not catch structural issues."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/mast-failure-taxonomy potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
AgentBoard Progress Scoring
Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.
tau-bench Agent Evaluation
Evaluate tool-agent-user interactions in customer service domains by judging task success, conversation quality, tool use correctness, and providing evaluation rationale.
DocBank Document Layout Detection
Document layout analysis benchmark (Li et al., COLING 2020). Detect and classify document elements including titles, abstracts, paragraphs, figures, tables, and captions.