When should you run a post-mortem?

Run post-mortems for customer-impacting incidents, near-misses that almost caused problems, systemic issues revealing deeper problems, and incidents that took unusually long to resolve. Not every small issue needs a full post-mortem—focus on incidents with learning value.

What is blameless post-mortem culture?

Blameless post-mortem culture means focusing on systemic issues rather than individual mistakes. It assumes people acted reasonably given the information they had at the time. This psychological safety encourages honest discussion about what went wrong, leading to better learning and prevention of repeat incidents.

How long should a post-mortem take?

Most post-mortem meetings run 45-60 minutes, though complex incidents may need 90 minutes. The document preparation typically takes 2-4 hours before the meeting. The key is scheduling promptly (within 2-5 days) while details are fresh, not dragging it out over weeks.

How to Run Post-Mortems: Blameless Incident Review Guide

Q: What is a post-mortem?

A post-mortem is a structured process where teams analyze past incidents to document what happened, why it happened, and what should change to prevent recurrence. The goal is to understand systemic issues and implement concrete improvements, not to assign blame to individuals.

What is a Post-Mortem?

A post-mortem is a structured process where teams analyze a past incident to document what happened, why it happened, and what should change to prevent recurrence. The term comes from medical practice—literally “after death”—but in engineering, it’s about learning from failures rather than assigning blame.

The goal isn’t to point fingers. It’s to understand systemic issues, capture institutional knowledge, and implement concrete improvements.

When done well, post-mortems transform incidents from disruptions into opportunities. When done poorly, they become blame sessions that destroy psychological safety and prevent honest conversation about what really went wrong.

Why Post-Mortems Matter

Most teams know they should run post-mortems. Few understand why they’re critical.

Post-mortems build learning culture. Teams that consistently analyze failures develop pattern recognition. They spot early warning signs, anticipate cascading failures, and respond faster when similar issues arise. This knowledge compounds over time—but only if it’s documented and shared.

Post-mortems prevent repeat incidents. Without systematic analysis, teams fix symptoms instead of root causes. The database was slow, so you restarted it. But why was it slow? Without answering that question, the same issue returns next week with a different symptom.

Post-mortems improve team performance. Engineers learn faster from documented failures than from scattered tribal knowledge. New team members can read past post-mortems to understand system behavior, common failure modes, and proven response procedures.

The alternative to post-mortems isn’t “moving fast”—it’s repeating mistakes until they become normal.

When to Run a Post-Mortem

Not every issue needs a full post-mortem. Run them for:

Customer-impacting incidents: Any outage, degradation, or error that affected users
Near-misses: Issues that almost caused customer impact
Pattern incidents: Recurring problems even if impact was minor
Learning opportunities: Novel failures that reveal system gaps

Timing matters. Hold the meeting 24-72 hours after incident resolution. Too soon, and engineers are exhausted. Too late, and memories fade.

Block 60-90 minutes. Shorter meetings rush through analysis. Longer meetings lose focus.

Before the Meeting: Preparation

The post-mortem meeting itself is only successful if you prepare properly. Before the meeting:

1. Reconstruct the Timeline

Build a detailed timeline of the incident from start to resolution. This should include:

Detection: When was the issue first noticed? By whom?
Actions taken: Every intervention, in chronological order
Decision points: Why did responders choose specific actions?
Communication: When were stakeholders notified?
Resolution: What ultimately fixed the issue?

This is where thorough incident documentation pays off. Platforms like Upstat automatically capture activity timelines, participant actions, and threaded discussions during incidents, eliminating the need to manually piece together events from Slack threads and scattered notes. The richer your incident data, the faster you can reconstruct what actually happened.

2. Gather Context

Collect supporting information:

Error logs and stack traces
Monitoring graphs and alerts
Code commits or configuration changes before the incident
Customer reports or support tickets

3. Set the Agenda

Share a structured agenda beforehand:

Timeline walkthrough (10 minutes)
What went well? (10 minutes)
What went poorly? (15 minutes)
Root cause analysis (20 minutes)
Action items (15 minutes)
Wrap-up (5 minutes)

4. Assign a Facilitator

The facilitator isn’t there to run the meeting—they’re there to create psychological safety. Their job is to:

Redirect blame-oriented language
Ensure everyone contributes
Keep discussion on track
Document key points and action items

Critical rule: The facilitator should NOT be the incident lead or anyone directly involved in critical decisions. They need objectivity.

The Post-Mortem Meeting

Set the Tone

Start by explicitly stating the meeting is blameless. Say it out loud:

“This is a blameless post-mortem. We’re here to understand systemic failures, not assign fault. If we find process gaps or unclear documentation, that’s what we fix—not the person who encountered them.”

This isn’t corporate speak. Research from Google’s SRE teams shows that blameless culture correlates directly with long-term reliability improvements. Teams that blame individuals for failures stop reporting problems honestly.

Walk Through the Timeline

Present the timeline chronologically. Avoid jumping to conclusions. Just state facts:

“At 2:14 PM, monitoring alerted on elevated API latency”
“At 2:18 PM, the on-call engineer acknowledged the alert”
“At 2:23 PM, the database team was paged”

Let participants fill in context:

“Why did it take 9 minutes to page the database team?”
“What information was available at 2:18 PM?”

Identify What Went Well

This sounds counterintuitive during an outage review, but it’s essential. What worked? What prevented worse impact?

Examples:

“Monitoring caught the issue before customers reported it”
“The rollback procedure worked correctly”
“Communication to stakeholders was clear and timely”

Recognizing what worked reinforces good practices and balances the conversation.

Analyze What Went Poorly

Now examine failures. Focus on systems and processes, not people.

Instead of: “John deployed broken code” Say: “The deployment didn’t trigger automated tests”

Instead of: “Sarah took too long to respond” Say: “Our escalation policy didn’t account for after-hours pages”

Notice the difference? The first version assigns blame. The second version identifies systemic gaps.

Root Cause Analysis Techniques

The 5 Whys

Start with the symptom and ask “why” repeatedly until you reach the root cause:

Symptom: Database became unresponsive

Why? Connection pool exhausted
Why? API made too many concurrent queries
Why? Rate limiting wasn’t enforced on the endpoint
Why? Rate limiting configuration wasn’t documented
Why? No process exists for documenting operational limits

Five whys later, you’ve moved from “database issue” to “documentation process gap.” That’s the real root cause.

Warning: Asking “Why did you…” sounds like blame. Ask “Why did the system…” instead.

Contributing Factors

Root causes are rarely singular. Look for contributing factors:

Technical: Configuration errors, software bugs, infrastructure limitations
Process: Missing runbooks, unclear escalation policies, inadequate testing
Communication: Delayed notifications, unclear responsibilities, poor handoffs
External: Third-party outages, unexpected traffic spikes, coordinated attacks

Document all contributing factors, even if they’re uncomfortable. The goal is learning, not looking good.

Creating Action Items

Action items are the only thing that matters. Without them, the post-mortem was a waste of time.

Bad action item:

“Improve monitoring”

Good action item:

“Add database connection pool monitoring with alerts at 80% capacity. Owner: Platform team. Deadline: 2 weeks.”

Every action item needs:

Specific task: What exactly will be done?
Owner: Who is responsible? (One person, not a team)
Deadline: When will it be completed?
Success criteria: How do we know it’s done?

Prioritize action items:

Must fix: Prevents recurrence of this exact issue
Should fix: Reduces likelihood or impact of similar issues
Nice to have: General improvements tangentially related

Focus on “must fix” items. Don’t overcommit.

After the Meeting: Follow-Through

The meeting is over. Now comes the hard part: actually implementing changes.

1. Document the Post-Mortem

Write up the post-mortem within 24 hours while discussion is fresh. Include:

Incident summary (what, when, impact)
Timeline of events
Root cause analysis
What went well / what went poorly
Action items with owners and deadlines

Make this document easily discoverable. Store it where engineers naturally look—in your incident management system, wiki, or shared documentation.

2. Track Action Items

Action items mean nothing without accountability. Track corrective actions systematically, assign owners, and monitor completion status. Tools like Upstat help maintain incident timelines and participant coordination, ensuring the detailed documentation needed for effective follow-up.

Set up reminders for approaching deadlines. Escalate overdue items. Review completion during team meetings.

Don’t let post-mortem knowledge stay siloed. Share key learnings:

Internally: Brief stakeholders on high-impact incidents
Cross-team: If other teams could encounter similar issues
Company-wide: For major outages or pattern failures

Some companies publish sanitized post-mortems externally. This builds trust with customers and contributes to industry knowledge.

Building Blameless Culture

Post-mortems only work in environments with psychological safety. Engineers need to feel safe admitting mistakes, asking questions, and surfacing problems without fear of punishment.

Red flags your culture isn’t blameless:

Engineers say “I should have known” during post-mortems
Managers ask “who was responsible?” before asking “what happened?”
Action items target individuals (“Sarah needs training”) instead of systems (“improve documentation”)
People stop volunteering information

How to build psychological safety:

Model vulnerability: Leaders should openly discuss their own mistakes
Reward honesty: Thank people for surfacing issues early
Respond to language: When someone blames an individual, redirect to system gaps
Separate performance reviews: Post-mortem participation should never factor into performance evaluations

Blameless doesn’t mean “no accountability.” It means holding people accountable for improving systems, not for the inevitable fact that complex systems fail.

Common Post-Mortem Mistakes

Mistake 1: Asking “Who?”

“Who deployed the change?” “Who approved this?” “Who didn’t catch this?”

These questions sound reasonable, but they derail analysis. They trigger defensiveness and shift focus from systems to individuals.

Ask “what” and “how” instead:

“What process approved this change?”
“How did this reach production?”
“What testing would have caught this?”

Mistake 2: Stopping at the Obvious Cause

“The server crashed” isn’t a root cause. “The configuration was wrong” isn’t a root cause. Keep asking why until you reach process and system gaps.

Mistake 3: Too Many Action Items

Twenty action items means zero action items. You’ll complete two and forget the rest.

Limit to 3-5 critical items per incident. Fix the most impactful gaps first.

Mistake 4: No Follow-Up

Writing action items feels productive. Actually implementing them is hard.

Schedule follow-up reviews. Check completion rates. Escalate delays. Without follow-through, you’re just documenting problems without solving them.

Mistake 5: Waiting Too Long

Memory fades fast. Engineers forget critical details. Logs rotate. Context disappears.

Hold the post-mortem within 72 hours. If that’s impossible due to extended incidents, at least document the timeline while it’s fresh.

Conclusion: Learning as Competitive Advantage

The best engineering teams aren’t the ones that never fail—they’re the ones that learn fastest from failures.

Post-mortems are how you institutionalize that learning. They turn one engineer’s hard-earned knowledge into the whole team’s experience. They reveal patterns invisible in individual incidents. They force honest conversations about technical debt, process gaps, and organizational blind spots.

But only if you commit to blameless analysis, specific action items, and consistent follow-through.

The alternative is repeating mistakes until they define your organization. Teams that embrace post-mortems build resilience. Teams that skip them build fragility.

Your production systems will fail. The question is whether you’ll learn anything when they do.

Citations

Postmortem Culture: Learning from Failure - Google Site Reliability Engineering Book
Example Postmortem - Google Site Reliability Engineering Book

Explore In Upstat

Track post-incident action items, maintain complete incident timelines, and build a searchable knowledge base of past failures that prevents recurrence.

Learn About Post-Incident Analysis

How to Run Post-Mortems

Post-mortems are structured incident reviews that help teams learn from failures, identify root causes, and prevent future issues. This guide explains how to conduct effective blameless post-mortems that drive continuous improvement.