Home » AI Articles » The Post-Mortem Prompt: What Happened, Why, Fixes, and Owner Next Steps

The Post-Mortem Prompt: What Happened, Why, Fixes, and Owner Next Steps

Views: 1


The term “Post-Mortem” in the technology world is not about finality; it’s about starting over better. It is the ritualistic, non-blaming process of dissecting a system failure—an outage, a security breach, or a major production bug—to understand what happened, why it occurred, and, most critically, how to prevent its recurrence.

But transforming the chaos of an incident—the stream of Slack messages, the fragmented log snippets, the panicked Jira updates—into a single, coherent, objective document is a massive, time-consuming effort. It often falls on the very engineers who just spent 12 hours fixing the problem. The result? Post-mortems that are rushed, blaming, or, worst of all, purely descriptive, lacking actionable preventative steps.

To solve this, we introduce The Post Mortem Prompt (PMP): a single, structured LLM prompt architecture designed to ingest the raw, messy incident data and compile it, through four rigorous analytical passes, into a standardized, complete, and actionable post-mortem document. The PMP forces the AI to act not as a summarizer, but as an objective, non-blaming Root Cause Analysis Engine.


The Failure of the Human-Written Post-Mortem

Before designing the PMP, we must understand the three critical flaws in incident documentation written under pressure:

1. The Blame Trap

Human-written documents, particularly early drafts, often contain language that implies or outright assigns fault (“Bob missed the check,” “The team failed to update the config”). This is antithetical to a healthy engineering culture. The goal of a post-mortem is to identify systemic weaknesses, not human errors. The AI must be strictly constrained to non-blaming, passive, and systemic language.

2. The Narrative Fluff

Teams often include lengthy descriptions of the war room, internal politics, or tangential issues. While useful for context, this narrative fluff obscures the core timeline and root cause. The document must be ruthlessly filtered to focus only on measurable events and causal chains.

3. Conceptual Fixes

A common problem is the inclusion of vague action items: “We need to improve monitoring,” or “Increase system resilience.” These are not tickets; they are aspirations. A successful post-mortem requires fixes that are specific, measurable, achievable, relevant, and time-bound (SMART).

The PMP architecture is built on four analytical phases, corresponding directly to the mandated sections of a high-quality post-mortem report.


Phase 1: The Timeline and Objective Summary (What Happened)

The first and most critical section of any post-mortem is the objective timeline. It strips away all speculation and emotion, leaving only a sequence of facts. The PMP must instruct the AI to execute a Mandatory Log-Filtering Pass over the raw input data.

1. Enforcing a Standardized Structure

Raw incident data includes chat logs, monitoring alerts, and deployment records. The AI’s first task is to unify this into a single, clean table format.

Prompt Directive: “Process all input data (logs, chat history, Jira updates). Filter out all conversational text and political commentary. Create an objective, chronological table using the following mandatory column structure: UTC Timestamp | Event Description | Detected By | User-Facing Effect.”

The use of a mandatory table structure forces the LLM to discard unstructured prose and map every data point to a specific, cold, hard fact.

2. Defining Key Milestones

To summarize the incident, the AI must identify specific timestamps for key milestones. This ensures the high-level summary is accurate and measurable.

Prompt Directive: “From the generated timeline, extract and explicitly define the following four key milestones: Detection Time (when monitoring first fired), Incident Start Time (when user-facing impact began), Mitigation Time (when the immediate impact was stopped, but not fully resolved), and Resolution Time (when service was fully restored and stable). Use these four data points to generate the high-level summary.”

The objective summary that precedes the timeline must be constrained to a maximum of 150 words and use only data derived from these four milestones and the timeline table. This prevents the LLM from generating excessive introductory narrative.


Create Pro-Level Videos for Half the Price! 🚨 Why pay full price when new VEED.io members get **50% OFF their first 3 months**? 🎬 Edit. Caption. Publish. All in minutes. 👉 [Sign up here & unlock your deal](https://veed.cello.so/l7L7DxcLDfq)

Phase 2: The Root Cause Analysis (Why)

Moving from “What Happened” to “Why” requires the LLM to transition from a factual summarizer to a rigorous analyst. The most effective technique for achieving this, which must be mandated in the prompt, is the Five Whys framework.

1. Mandating the Five Whys Technique

The common human error is stopping at the first “Why.” (“Why did the service fail? Because the database crashed.”) The LLM must be forced to drill down until it identifies a systemic failure in process, architecture, or resource management.

Prompt Directive: “Apply the Five Whys analytical framework to the Incident Start Time event. Do not stop at the first or second ‘Why.’ Continue the sequence until the root cause identified is a systemic failure (e.g., lack of automated testing, insufficient architectural isolation, or outdated documentation) and not a direct human operational error.”

The AI’s output for this section should be a simple, numbered list of the five (or more) sequential “Why” questions and answers, clearly showing the progression from symptom to root cause.

2. Differentiating Factors from Causes

A successful PM separates contributing factors (things that made the incident worse) from the single, primary root cause (the initiator).

Prompt Directive: “Based on the Five Whys outcome, clearly define and separate the Primary Root Cause (the initiating systemic failure) and Contributing Factors (e.g., inadequate alerting threshold, lack of canary deployment, poor internal documentation) that amplified the incident’s duration or impact. List the factors separately in a bulleted list.”

3. The Non-Blaming Language Filter

This is a persona constraint that must be applied throughout the Root Cause section to uphold a learning culture.

Prompt Directive: “Maintain a strictly non-blaming, objective tone. Use passive voice when discussing actions (e.g., ‘A configuration change was deployed’ instead of ‘The engineer deployed a configuration change’). Never reference specific names or roles unless explicitly required to identify the owner of a system or service, not a person.”


Phase 3: Actionable, Preventative Fixes (Fixes)

The output of the root cause analysis must be directly transformed into measurable work items. This is where the PMP prevents conceptual fixes and enforces SMART goals.

1. Classification of Fixes

Not all fixes are created equal. Some are quick safety nets; others are major architectural investments. The LLM must classify them.

Prompt Directive: “Generate a list of mandatory actions derived directly from the Primary Root Cause and all Contributing Factors. Classify each action item into one of two categories: Short-Term Mitigation (immediate changes to prevent recurrence in the next 7 days, e.g., rollback or temporary cap) and Long-Term Prevention (architectural/process changes that eliminate the root cause, requiring planning and resources).”

2. The SMART Fix Constraint

The LLM tends to write high-level recommendations. The PMP must force it to generate ticket-ready work items.

Prompt Directive: “Ensure every action item is SMART: specific, measurable, and tied to a concrete system change. Do not use phrases like ‘Improve logging’ or ‘Increase monitoring.’ Instead, use phrases like ‘Implement structured JSON logging for AuthService on error level,’ or ‘Adjust PagerDuty trigger threshold for Database Connection Pool usage from 85% to 70%.'”

3. The Ticket Naming Convention

To integrate seamlessly with project tracking tools, the fixes must adhere to a strict naming standard.

Prompt Directive: “Prefix all Long-Term Prevention actions with a standardized ticket title convention: [PM-YYYY-MM-DD-INCIDENT] Action Description. (e.g., [PM-2025-11-17-DB_CRASH] Implement connection throttling middleware for database)”.


Phase 4: Ownership and Next Steps

A post-mortem without accountability is just an expensive documentation exercise. The final phase of the PMP links every Long-Term Fix to an accountable party and defines the follow-through process.

1. Mandatory Owner Assignment

The LLM must be explicitly told to assign the work to a role or team, not an individual, to prevent single points of failure and burnout.

Prompt Directive: “For every Long-Term Prevention action item, you must append three mandatory fields: Owner Role/Team (e.g., SRE Team, API Gateway Team, Data Governance), Priority (P1-Critical, P2-High, P3-Medium), and Estimated Size (S, M, L—corresponding to 1 week, 1-3 weeks, or >1 month of effort). This forms a final, ready-to-execute action registry.”

The resulting Action Registry table serves as the primary deliverable for the team.

2. The Follow-Up Mandate

Preventing a PM from being filed and forgotten requires a mandated follow-up review.

Prompt Directive: “Specify a mandatory Review and Validation Meeting to be scheduled 30 days after the Resolution Time. The purpose of this meeting is to review the status of all P1 and P2 Long-Term Prevention fixes. The owner of the post-mortem document (the SRE Lead role) is responsible for scheduling this meeting.”

This simple instruction ensures a closed-loop system: the failure is documented, the fixes are created, and the follow-up is scheduled automatically.


Affiliate Link
See our Affiliate Disclosure page for more details on what affiliate links do for our website.

Amazon Prime Subscription - Affiliate link for Alt+Penguin which can lead to a commission being received when the offer is completed.
Amazon Prime Subscription – Affiliate link for Alt+Penguin which can lead to a commission being received when the offer is completed.

The Unified Post Mortem Prompt (PMP) Template

By combining these four phases, we create a powerful, self-correcting analytical prompt that transforms chaotic data into structured learning.

Markdown

**[SYSTEM INSTRUCTION: POST MORTEM PROMPT (PMP) ENGINE 4.0]**

**1. OBJECTIVITY & TONE:** Act as a senior, non-blaming Site Reliability Engineer (SRE). Your output must be strictly objective, factual, and focused on systemic improvements. Use passive voice exclusively when describing failure mechanisms.

**2. PHASE 1: TIMELINE & SUMMARY (WHAT HAPPENED):**

– **Log-Filtering Pass:** Filter out all conversational text. Create a mandatory, chronological table structure: `UTC Timestamp | Event Description | Detected By | User-Facing Effect`.

– **Milestones:** Identify and define **Detection Time, Incident Start Time, Mitigation Time, and Resolution Time**.

– **Summary:** Generate a 150-word maximum summary using only data derived from these milestones.

**3. PHASE 2: ROOT CAUSE ANALYSIS (WHY):**

– **Five Whys Mandate:** Rigorously apply the **Five Whys** framework until the cause is systemic (architectural, process, or resource-related), not human error.

– **Causal Separation:** Clearly separate the **Primary Root Cause** (the initiating failure) from all **Contributing Factors** (items that worsened the impact). List factors separately.

**4. PHASE 3: ACTIONABLE FIXES (FIXES):**

– **Classification:** Categorize all generated actions as **Short-Term Mitigation** (immediate) or **Long-Term Prevention** (architectural/process).

– **SMART Constraint:** Ensure all actions are specific, measurable, and ticket-ready. Avoid conceptual phrases (e.g., replace “Improve resilience” with “Implement connection retry logic on Client V3”).

– **Ticket Convention:** Prefix all Long-Term Prevention actions with the required ticket title convention: `[PM-YYYY-MM-DD-INCIDENT] Action Description`.

**5. PHASE 4: OWNERSHIP & NEXT STEPS:**

– **Action Registry:** For every Long-Term Prevention action, mandate three appended fields: **Owner Role/Team**, **Priority** (P1/P2/P3), and **Estimated Size** (S/M/L).

– **Follow-up:** Mandate a **Review and Validation Meeting** 30 days post-resolution, to be owned by the SRE Lead.

**[SOURCE DATA INPUT START]**

[PASTE ALL LOGS, CHAT HISTORY, MONITORING ALERTS, AND TEAM NOTES HERE]

**[SOURCE DATA INPUT END]**


Conclusion: Turning Failure into a Database of Learning

The true power of The Post Mortem Prompt is not merely in documenting the failure, but in transforming failure into a reliable, structured database of learning.

By automating the most difficult and subjective parts of incident response—the analytical extraction and the non-blaming narrative construction—teams save valuable engineering time and ensure consistency. Every incident, regardless of its severity, is dissected under the same high-standard lens.

The PMP allows an organization to scale its learning process, ensuring that the mistakes of yesterday are correctly and objectively translated into the architecture and processes of tomorrow. It moves the post-mortem from being a dreaded, administrative task to a high-value, automated component of a healthy SRE workflow.


By hitting the Subscribe button, you are consenting to receive emails from AltPenguin.com via our Newsletter.

Thank you for Subscribing to the Alt+Penguin Newsletter!

Verified by MonsterInsights