What qualifies as a major incident?

Major incidents are high-impact failures affecting many users, critical business functions, or SLA commitments. They typically involve extended downtime, data loss, security breaches, or significant revenue impact. Organizations define specific severity criteria based on customer impact and business priorities.

When should you conduct a major incident review?

Conduct reviews within 2-5 business days after resolution while details are fresh but after teams have recovered from response fatigue. Earlier than 48 hours risks exhausted participants. Later than one week risks fading memories and lost context.

Who should participate in major incident reviews?

Include all incident responders, relevant stakeholders who can provide context, and teams that might prevent similar issues. Keep the group focused—typically 5-12 people. Broader than that, split into separate sessions or use asynchronous documentation reviews.

How are major incident reviews different from regular post-mortems?

Major incident reviews involve deeper analysis, executive stakeholder participation, formal documentation requirements, and often result in cross-team process changes or architectural decisions. They examine organizational gaps and systemic risks rather than just technical fixes.

Major Incident Review Process: Guide to Post-Incident Analysis

What Makes an Incident Major?

Not every outage deserves the same level of analysis. Major incidents are different.

They affect significant user populations. They breach SLA commitments. They cause data loss or security exposure. They reveal fundamental architectural weaknesses or organizational gaps that smaller incidents merely hint at.

Major incidents demand structured review processes because their impact justifies the investment. A 5-minute API blip might warrant a quick retrospective. A 3-hour multi-region outage affecting thousands of customers requires systematic analysis, executive stakeholder involvement, and documented process changes.

The question is not whether to conduct reviews for major incidents—it is how to conduct them effectively enough to prevent recurrence.

Implementation Note: This guide covers industry best practices for conducting major incident reviews. While Upstat provides comprehensive incident data—including timelines, participant tracking, MTTR metrics, and activity logs—teams typically document postmortem findings in external collaboration tools like Google Docs, Confluence, or Notion. Upstat excels at capturing the incident lifecycle data that informs these reviews, allowing teams to use their preferred documentation platforms for postmortem reports while maintaining complete incident history in Upstat.

Severity Criteria That Trigger Major Incident Reviews

Organizations define major incidents differently based on business priorities, but common criteria include:

Customer Impact:

Over 10% of users unable to access core functionality
Complete service unavailability for any duration
Data loss or corruption affecting customers
Security breach or unauthorized data access

Business Impact:

SLA breach triggering financial penalties
Revenue loss exceeding defined thresholds
Regulatory compliance violations
Significant brand or reputation damage

Technical Impact:

Multi-region or multi-service cascading failures
Data center or availability zone complete failures
Extended incident duration (over 2 hours for critical systems)
Novel failure modes revealing architectural gaps

Organizational Impact:

Incidents requiring executive escalation
Cross-team coordination failures
Communication breakdowns affecting customer trust
Pattern failures revealing systemic process gaps

Not every criterion needs to be met. Any single factor can elevate an incident to major status and trigger formal review processes.

Preparation: Gathering the Evidence

Major incident reviews fail when preparation is inadequate. Before scheduling the review meeting, collect comprehensive incident data.

Reconstruct the Complete Timeline

Build a detailed chronology from first detection through final resolution. Include:

Detection and Escalation:

When did monitoring first detect abnormal behavior?
How long until the first alert fired?
Who received initial notifications?
What was the escalation path?

Response Actions:

Every intervention attempted, in chronological order
Why responders chose specific actions
What information was available at each decision point
Communication to stakeholders and customers

Resolution and Recovery:

What ultimately resolved the issue?
How was resolution verified?
What recovery steps restored full functionality?
When was the incident formally closed?

Platforms like Upstat automatically capture activity timelines, participant actions, and threaded discussions during incidents—eliminating the manual reconstruction work that often introduces errors or omissions. The richer your incident data, the more accurate your timeline and the better your analysis.

Collect Supporting Evidence

Gather technical artifacts that provide context:

System Data:

Error logs and stack traces from affected services
Monitoring graphs showing performance degradation
Alert history and notification delivery logs
Database query performance metrics

Change History:

Recent code deployments or configuration changes
Infrastructure modifications
Dependency updates or third-party service changes
Feature flags or experiment rollouts

Customer Impact:

Support ticket volume and themes
Customer communications and status page updates
SLA breach calculations
Revenue or conversion impact estimates

Response Coordination:

Chat transcripts from incident response channels
Decision rationale documented during response
Escalation paths followed
External vendor involvement

Calculate Key Metrics

Quantify the incident’s impact and response effectiveness using metrics that inform analysis:

Duration Metrics:

Time to Detect (TTD): From issue start to first alert
Time to Acknowledge (TTA): From alert to responder acknowledgment
Time to Resolve (TTR): From detection to full resolution
Total Customer Impact: From first user impact to verified recovery

Upstat provides built-in MTTR analytics and incident duration tracking across severity levels, making it easy to benchmark this incident against historical patterns and identify outliers.

Scope Metrics:

Users affected (percentage and absolute numbers)
Geographic regions impacted
Services or features degraded
Revenue or transaction volume lost

Response Metrics:

Responders involved
Communication touchpoints
Rollback or mitigation attempts
Escalations required

These metrics provide objective data points that prevent analysis from devolving into subjective arguments about what felt fast or slow.

Conducting the Review Meeting

Preparation creates the foundation. The review meeting extracts insights and drives decisions.

Set the Meeting Structure

Schedule 90-120 minutes for major incident reviews. Shorter meetings rush through critical analysis. Longer meetings lose focus and exhaust participants.

Recommended Agenda:

Incident summary and impact (10 minutes)
Timeline walkthrough (20 minutes)
What went well (15 minutes)
What went poorly (20 minutes)
Root cause analysis (25 minutes)
Action items and ownership (20 minutes)
Wrap-up and next steps (10 minutes)

Assign a facilitator who was not directly involved in incident response. They need objectivity to redirect unproductive discussions and maintain blameless culture.

Establish Blameless Culture

Major incidents often involve high-stakes decisions under pressure. Without explicitly blameless framing, reviews become defensive exercises in blame avoidance rather than learning opportunities.

Start by stating explicitly:

“This is a blameless review. We are analyzing systemic failures and organizational gaps, not evaluating individuals. Our goal is understanding what enabled this incident and what changes will prevent recurrence.”

Focus language on systems, not people:

Instead of: “The engineer deployed without testing”
Say: “The deployment process allowed untested code to reach production”
Instead of: “The on-call responder took too long”
Say: “Our escalation policy did not account for after-hours delays”

When participants use blame-oriented language, redirect immediately: “Let’s focus on what systemic changes would have prevented this outcome.”

Walk Through the Timeline

Present the incident timeline chronologically without jumping to conclusions. State facts, then let participants add context.

Facilitator presents: “At 2:14 PM, monitoring detected elevated API error rates. At 2:18 PM, the on-call engineer acknowledged the alert.”

Discussion questions:

What information was visible in monitoring at 2:14 PM?
Why did acknowledgment take 4 minutes?
What actions were possible at that point?

This approach surfaces decision context. Engineers often made reasonable choices given available information and time pressure. Understanding their reasoning reveals the real gaps—missing monitoring, unclear runbooks, ambiguous escalation policies.

Identify What Went Well

This sounds counterintuitive during a postmortem, but recognizing effective responses is critical for two reasons:

First, it reinforces behaviors worth repeating. If communication to customers was clear and timely, document that process so future incidents follow the same pattern.

Second, it balances the conversation. Focusing only on failures creates learned helplessness. Acknowledging what worked provides psychological safety to discuss what did not.

Examples of what might have gone well:

Monitoring detected the issue before customer reports
Escalation brought the right expertise quickly
Rollback procedures worked correctly
Status page updates kept customers informed
Cross-team coordination prevented cascading failures

Analyze What Went Poorly

Now examine failures with focus on systems and processes.

Major incidents typically reveal multiple gaps:

Detection Gaps:

Why did monitoring miss early warning signs?
Were thresholds too permissive?
Were critical metrics not monitored at all?

Response Gaps:

What delayed initial response?
Were runbooks unavailable or inaccurate?
Did responders lack necessary access or permissions?
Were escalation paths unclear?

Communication Gaps:

When were stakeholders notified?
Were customer communications timely and accurate?
Did internal coordination break down?

Technical Gaps:

What architectural weaknesses enabled the failure?
Were there missing safeguards or circuit breakers?
Did dependencies fail in unexpected ways?

Document all contributing factors, even uncomfortable ones. Process gaps, unclear ownership, inadequate testing, technical debt shortcuts—major incidents expose organizational realities that normal operations hide.

Root Cause Analysis Techniques

Major incidents rarely have single root causes. They typically involve multiple contributing factors that combined in unexpected ways.

The 5 Whys Method

Start with the immediate symptom and ask “why” repeatedly until you reach systemic root causes:

Symptom: Database connection pool exhausted, causing API timeouts

Why? Too many concurrent queries exceeded pool limits
Why? A batch job started processing all users simultaneously
Why? The job scheduling system did not enforce concurrency limits
Why? Concurrency controls were not part of the batch job framework
Why? No architectural review process exists for new batch processing patterns

Five whys later, you have moved from “database issue” to “missing architectural review process.” That is the systemic gap requiring organizational change, not just a technical fix.

Critical rule: Ask “Why did the system allow this?” not “Why did the engineer do this?” The first reveals process gaps. The second assigns blame and stops learning.

Contributing Factor Analysis

Map all factors that contributed to the incident:

Technical Factors:

Code bugs or logic errors
Infrastructure capacity limits
Configuration mistakes
Dependency failures
Data corruption or inconsistency

Process Factors:

Missing or inadequate runbooks
Unclear escalation policies
Insufficient testing procedures
Deployment process gaps
Change approval weaknesses

Organizational Factors:

Communication breakdowns
Unclear ownership or responsibilities
Knowledge gaps or training needs
Competing priorities delaying fixes
Technical debt accumulation

External Factors:

Third-party service outages
Unexpected traffic patterns
Security attacks
Infrastructure provider issues

Document all contributing factors. If fixing any single factor would have prevented or significantly reduced impact, include it in root cause analysis.

Systemic Patterns

Major incidents often reveal patterns invisible in smaller failures:

Is this the third database incident this quarter?
Have deployment-related outages increased?
Are similar communication gaps appearing repeatedly?
Do incidents consistently occur during specific timeframes?

Pattern recognition transforms isolated incidents into organizational learning opportunities. Three database incidents might each have different immediate causes but share a common root cause: inadequate capacity planning processes.

Creating Actionable Outcomes

Analysis without action wastes everyone’s time. Major incident reviews must produce concrete changes.

Define Specific Action Items

Bad action item:

“Improve monitoring coverage”

Good action item:

“Add connection pool utilization monitoring for all database instances with alerts at 75% capacity. Owner: Platform team (Sarah). Deadline: 2 weeks. Success criteria: Alerts tested and verified in staging.”

Every action item requires:

Specific task: What exactly will be done?
Single owner: Who is responsible? One person, not a team.
Deadline: When will it be completed?
Success criteria: How do we verify completion?
Priority: Must-fix, should-fix, or nice-to-have?

Prioritize Ruthlessly

Major incidents often generate dozens of potential improvements. Do not try to fix everything.

Must-fix items prevent recurrence of this exact issue:

Missing monitoring that would have detected the problem earlier
Runbook gaps that delayed response
Architecture changes that eliminate the failure mode

Should-fix items reduce likelihood or impact of similar issues:

Improved escalation policies
Better testing procedures
Enhanced communication templates

Nice-to-have items are general improvements tangentially related:

Refactoring old code
Upgrading dependencies
Process documentation updates

Focus on must-fix items. Commit to 3-5 critical changes rather than 20 vague improvements. Overcommitment guarantees nothing gets completed.

Assign Organizational-Level Changes

Major incidents often require changes beyond technical fixes:

Process Changes:

New architectural review requirements
Updated deployment approval workflows
Enhanced testing mandates
Revised escalation policies

Organizational Changes:

Dedicated team ownership reassignments
Cross-team coordination improvements
Knowledge sharing mechanisms
Training programs or skill development

Cultural Changes:

Blameless culture reinforcement
Psychological safety improvements
Incident response practice exercises
Executive involvement in reviews

These changes require leadership sponsorship. Major incident reviews should include stakeholders with authority to approve organizational changes, not just technical teams proposing them.

Documentation and Follow-Through

The review meeting ends. Now comes the harder part: implementing changes and documenting learnings.

Write the Post-Incident Report

Document the major incident review within 48 hours while discussion is fresh. External tools like Google Docs, Confluence, or Notion work well for collaborative postmortem documentation, while platforms like Upstat maintain the incident timeline, participant coordination, and MTTR data that informed the analysis.

Essential sections:

Executive Summary:

What happened (one paragraph)
Customer impact (metrics)
Root cause (one sentence)
Remediation status

Detailed Timeline:

Chronological sequence of events
Key decision points and rationale
Communication touchpoints
Resolution steps

Root Cause Analysis:

Contributing technical factors
Contributing process factors
Contributing organizational factors
Systemic patterns identified

What Went Well / Poorly:

Effective response elements to repeat
Failure points requiring improvement
Unexpected positive or negative findings

Action Items:

Must-fix items with owners and deadlines
Should-fix items for backlog
Organizational changes required

Lessons Learned:

Key takeaways for other teams
Process improvements identified
Knowledge gaps discovered

Make this document easily discoverable. Store it in shared documentation systems where engineers naturally look for incident history.

Track Action Item Completion

Action items without accountability become wishful thinking.

Assign a single owner to track overall action item completion—typically the incident lead or a designated program manager for major incidents.

Weekly check-ins:

Review action item status
Identify blockers
Escalate overdue items
Adjust timelines if needed

Visibility mechanisms:

Dashboard showing completion progress
Regular updates in team meetings
Executive status reports for major incidents
Automated reminders for approaching deadlines

Completion verification:

Test new monitoring in staging
Verify runbook updates during incident simulations
Confirm process changes through execution
Validate architectural changes via code review

Do not close the major incident until critical action items are verified complete. Declaring victory while fixes remain unimplemented guarantees recurrence.

Major incident knowledge should not stay siloed within the response team.

Internal sharing:

Brief all engineering teams on findings
Update runbooks and documentation
Incorporate lessons into training programs
Share relevant findings with leadership

Cross-team sharing:

If other teams could encounter similar issues, proactively brief them
Update shared infrastructure documentation
Propose organization-wide process improvements

External sharing:

Some companies publish sanitized postmortems publicly
This builds customer trust through transparency
Contributes to industry knowledge
Demonstrates commitment to improvement

Building Continuous Improvement Culture

Major incident reviews are not isolated events. They are part of continuous organizational learning.

Quarterly pattern analysis:

Review all major incidents from the quarter
Identify recurring themes
Spot emerging risks
Track improvement metric trends

Incident review retrospectives:

Periodically review the review process itself
Are reviews producing meaningful changes?
Are action items getting completed?
Is blameless culture holding?

Metrics that measure learning:

MTTR improvement over time
Incident recurrence rates
Action item completion percentages
Time from incident to review completion

Culture indicators:

Teams request reviews proactively
Engineers volunteer honest context
Action items focus on systems, not people
Improvement velocity accelerates

The best organizations treat major incidents as tuition paid for expensive lessons. The worst waste those lessons by rushing back to normal operations without implementing meaningful changes.

Common Major Incident Review Mistakes

Mistake 1: Delaying the Review

Waiting weeks to conduct reviews guarantees information loss. Engineers forget context. Logs rotate. Details fade.

Hold major incident reviews within 2-5 business days. Earlier than 48 hours risks exhausted participants. Later than one week risks incomplete analysis.

Mistake 2: Stopping at Technical Root Causes

“The database query was inefficient” is not a complete root cause for a major incident.

Why was the inefficient query deployed? Why did testing not catch it? Why were safeguards insufficient? Why did monitoring not detect performance degradation earlier?

Technical fixes are necessary but not sufficient. Organizational and process gaps enabled the technical failure.

Mistake 3: Too Many Participants

Review meetings with 20 people become unfocused discussions where critical voices go unheard.

Limit core participants to 5-12 people: incident responders, subject matter experts, and key stakeholders. Use asynchronous documentation review for broader audiences.

Mistake 4: No Executive Accountability

Major incidents often reveal organizational gaps requiring executive decisions—budget for infrastructure improvements, headcount for critical teams, process changes affecting multiple departments.

Without executive participation or sponsorship, major incident reviews produce recommendations without authority to implement them.

Mistake 5: Treating Symptoms, Not Systems

Fixing the immediate technical issue without addressing systemic enablers guarantees recurrence with different symptoms.

If every quarterly major incident involves database capacity, the problem is not individual database configurations. The problem is inadequate capacity planning processes or insufficient investment in infrastructure.

Conclusion: Learning as Organizational Capability

Major incidents are expensive. Customer trust erodes. Revenue is lost. Engineering teams face stress and fatigue. The only way to recoup that cost is through learning that prevents future failures.

Effective major incident review processes transform high-impact outages into organizational capabilities. They force honest examination of technical debt, process gaps, and cultural weaknesses. They produce documented knowledge that compounds over time. They build muscle memory for responding to novel failures.

But only if teams commit to structured preparation, blameless analysis, specific action items, and rigorous follow-through.

Platforms like Upstat provide the incident management foundation—capturing timelines, tracking participants, calculating MTTR metrics, and maintaining searchable incident history. This detailed operational data informs thorough major incident reviews without requiring manual reconstruction of events.

The teams that master major incident reviews build competitive advantage. They learn faster than competitors. They prevent catastrophic repeat failures. They develop pattern recognition that makes novel incidents feel familiar.

The teams that skip reviews or execute them poorly repeat expensive mistakes until customers lose faith and engineering talent seeks employers that value learning.

Your systems will experience major incidents. The question is whether you will extract the maximum learning value from them.

Explore In Upstat

Capture complete incident timelines, participant actions, and MTTR metrics automatically—providing the detailed data foundation teams need for thorough major incident reviews.

Learn About Incident Management

Major Incident Review Process

Major incident reviews transform high-impact outages into systemic improvements. This guide explains how to prepare for reviews using incident data, conduct structured analysis meetings, identify root causes, and implement changes that prevent recurrence.