What Makes an Incident Major?
Not every outage deserves the same level of analysis. Major incidents are different.
They affect significant user populations. They breach SLA commitments. They cause data loss or security exposure. They reveal fundamental architectural weaknesses or organizational gaps that smaller incidents merely hint at.
Major incidents demand structured review processes because their impact justifies the investment. A 5-minute API blip might warrant a quick retrospective. A 3-hour multi-region outage affecting thousands of customers requires systematic analysis, executive stakeholder involvement, and documented process changes.
The question is not whether to conduct reviews for major incidents—it is how to conduct them effectively enough to prevent recurrence.
Implementation Note: This guide covers industry best practices for conducting major incident reviews. While Upstat provides comprehensive incident data—including timelines, participant tracking, MTTR metrics, and activity logs—teams typically document postmortem findings in external collaboration tools like Google Docs, Confluence, or Notion. Upstat excels at capturing the incident lifecycle data that informs these reviews, allowing teams to use their preferred documentation platforms for postmortem reports while maintaining complete incident history in Upstat.
Severity Criteria That Trigger Major Incident Reviews
Organizations define major incidents differently based on business priorities, but common criteria include:
Customer Impact:
- Over 10% of users unable to access core functionality
- Complete service unavailability for any duration
- Data loss or corruption affecting customers
- Security breach or unauthorized data access
Business Impact:
- SLA breach triggering financial penalties
- Revenue loss exceeding defined thresholds
- Regulatory compliance violations
- Significant brand or reputation damage
Technical Impact:
- Multi-region or multi-service cascading failures
- Data center or availability zone complete failures
- Extended incident duration (over 2 hours for critical systems)
- Novel failure modes revealing architectural gaps
Organizational Impact:
- Incidents requiring executive escalation
- Cross-team coordination failures
- Communication breakdowns affecting customer trust
- Pattern failures revealing systemic process gaps
Not every criterion needs to be met. Any single factor can elevate an incident to major status and trigger formal review processes.
Preparation: Gathering the Evidence
Major incident reviews fail when preparation is inadequate. Before scheduling the review meeting, collect comprehensive incident data.
Reconstruct the Complete Timeline
Build a detailed chronology from first detection through final resolution. Include:
Detection and Escalation:
- When did monitoring first detect abnormal behavior?
- How long until the first alert fired?
- Who received initial notifications?
- What was the escalation path?
Response Actions:
- Every intervention attempted, in chronological order
- Why responders chose specific actions
- What information was available at each decision point
- Communication to stakeholders and customers
Resolution and Recovery:
- What ultimately resolved the issue?
- How was resolution verified?
- What recovery steps restored full functionality?
- When was the incident formally closed?
Platforms like Upstat automatically capture activity timelines, participant actions, and threaded discussions during incidents—eliminating the manual reconstruction work that often introduces errors or omissions. The richer your incident data, the more accurate your timeline and the better your analysis.
Collect Supporting Evidence
Gather technical artifacts that provide context:
System Data:
- Error logs and stack traces from affected services
- Monitoring graphs showing performance degradation
- Alert history and notification delivery logs
- Database query performance metrics
Change History:
- Recent code deployments or configuration changes
- Infrastructure modifications
- Dependency updates or third-party service changes
- Feature flags or experiment rollouts
Customer Impact:
- Support ticket volume and themes
- Customer communications and status page updates
- SLA breach calculations
- Revenue or conversion impact estimates
Response Coordination:
- Chat transcripts from incident response channels
- Decision rationale documented during response
- Escalation paths followed
- External vendor involvement
Calculate Key Metrics
Quantify the incident’s impact and response effectiveness using metrics that inform analysis:
Duration Metrics:
- Time to Detect (TTD): From issue start to first alert
- Time to Acknowledge (TTA): From alert to responder acknowledgment
- Time to Resolve (TTR): From detection to full resolution
- Total Customer Impact: From first user impact to verified recovery
Upstat provides built-in MTTR analytics and incident duration tracking across severity levels, making it easy to benchmark this incident against historical patterns and identify outliers.
Scope Metrics:
- Users affected (percentage and absolute numbers)
- Geographic regions impacted
- Services or features degraded
- Revenue or transaction volume lost
Response Metrics:
- Responders involved
- Communication touchpoints
- Rollback or mitigation attempts
- Escalations required
These metrics provide objective data points that prevent analysis from devolving into subjective arguments about what felt fast or slow.
Conducting the Review Meeting
Preparation creates the foundation. The review meeting extracts insights and drives decisions.
Set the Meeting Structure
Schedule 90-120 minutes for major incident reviews. Shorter meetings rush through critical analysis. Longer meetings lose focus and exhaust participants.
Recommended Agenda:
- Incident summary and impact (10 minutes)
- Timeline walkthrough (20 minutes)
- What went well (15 minutes)
- What went poorly (20 minutes)
- Root cause analysis (25 minutes)
- Action items and ownership (20 minutes)
- Wrap-up and next steps (10 minutes)
Assign a facilitator who was not directly involved in incident response. They need objectivity to redirect unproductive discussions and maintain blameless culture.
Establish Blameless Culture
Major incidents often involve high-stakes decisions under pressure. Without explicitly blameless framing, reviews become defensive exercises in blame avoidance rather than learning opportunities.
Start by stating explicitly:
“This is a blameless review. We are analyzing systemic failures and organizational gaps, not evaluating individuals. Our goal is understanding what enabled this incident and what changes will prevent recurrence.”
Focus language on systems, not people:
Instead of: “The engineer deployed without testing”
Say: “The deployment process allowed untested code to reach production”
Instead of: “The on-call responder took too long”
Say: “Our escalation policy did not account for after-hours delays”
When participants use blame-oriented language, redirect immediately: “Let’s focus on what systemic changes would have prevented this outcome.”
Walk Through the Timeline
Present the incident timeline chronologically without jumping to conclusions. State facts, then let participants add context.
Facilitator presents: “At 2:14 PM, monitoring detected elevated API error rates. At 2:18 PM, the on-call engineer acknowledged the alert.”
Discussion questions:
- What information was visible in monitoring at 2:14 PM?
- Why did acknowledgment take 4 minutes?
- What actions were possible at that point?
This approach surfaces decision context. Engineers often made reasonable choices given available information and time pressure. Understanding their reasoning reveals the real gaps—missing monitoring, unclear runbooks, ambiguous escalation policies.
Identify What Went Well
This sounds counterintuitive during a postmortem, but recognizing effective responses is critical for two reasons:
First, it reinforces behaviors worth repeating. If communication to customers was clear and timely, document that process so future incidents follow the same pattern.
Second, it balances the conversation. Focusing only on failures creates learned helplessness. Acknowledging what worked provides psychological safety to discuss what did not.
Examples of what might have gone well:
- Monitoring detected the issue before customer reports
- Escalation brought the right expertise quickly
- Rollback procedures worked correctly
- Status page updates kept customers informed
- Cross-team coordination prevented cascading failures
Analyze What Went Poorly
Now examine failures with focus on systems and processes.
Major incidents typically reveal multiple gaps:
Detection Gaps:
- Why did monitoring miss early warning signs?
- Were thresholds too permissive?
- Were critical metrics not monitored at all?
Response Gaps:
- What delayed initial response?
- Were runbooks unavailable or inaccurate?
- Did responders lack necessary access or permissions?
- Were escalation paths unclear?
Communication Gaps:
- When were stakeholders notified?
- Were customer communications timely and accurate?
- Did internal coordination break down?
Technical Gaps:
- What architectural weaknesses enabled the failure?
- Were there missing safeguards or circuit breakers?
- Did dependencies fail in unexpected ways?
Document all contributing factors, even uncomfortable ones. Process gaps, unclear ownership, inadequate testing, technical debt shortcuts—major incidents expose organizational realities that normal operations hide.
Root Cause Analysis Techniques
Major incidents rarely have single root causes. They typically involve multiple contributing factors that combined in unexpected ways.
The 5 Whys Method
Start with the immediate symptom and ask “why” repeatedly until you reach systemic root causes:
Symptom: Database connection pool exhausted, causing API timeouts
- Why? Too many concurrent queries exceeded pool limits
- Why? A batch job started processing all users simultaneously
- Why? The job scheduling system did not enforce concurrency limits
- Why? Concurrency controls were not part of the batch job framework
- Why? No architectural review process exists for new batch processing patterns
Five whys later, you have moved from “database issue” to “missing architectural review process.” That is the systemic gap requiring organizational change, not just a technical fix.
Critical rule: Ask “Why did the system allow this?” not “Why did the engineer do this?” The first reveals process gaps. The second assigns blame and stops learning.
Contributing Factor Analysis
Map all factors that contributed to the incident:
Technical Factors:
- Code bugs or logic errors
- Infrastructure capacity limits
- Configuration mistakes
- Dependency failures
- Data corruption or inconsistency
Process Factors:
- Missing or inadequate runbooks
- Unclear escalation policies
- Insufficient testing procedures
- Deployment process gaps
- Change approval weaknesses
Organizational Factors:
- Communication breakdowns
- Unclear ownership or responsibilities
- Knowledge gaps or training needs
- Competing priorities delaying fixes
- Technical debt accumulation
External Factors:
- Third-party service outages
- Unexpected traffic patterns
- Security attacks
- Infrastructure provider issues
Document all contributing factors. If fixing any single factor would have prevented or significantly reduced impact, include it in root cause analysis.
Systemic Patterns
Major incidents often reveal patterns invisible in smaller failures:
- Is this the third database incident this quarter?
- Have deployment-related outages increased?
- Are similar communication gaps appearing repeatedly?
- Do incidents consistently occur during specific timeframes?
Pattern recognition transforms isolated incidents into organizational learning opportunities. Three database incidents might each have different immediate causes but share a common root cause: inadequate capacity planning processes.
Creating Actionable Outcomes
Analysis without action wastes everyone’s time. Major incident reviews must produce concrete changes.
Define Specific Action Items
Bad action item:
“Improve monitoring coverage”
Good action item:
“Add connection pool utilization monitoring for all database instances with alerts at 75% capacity. Owner: Platform team (Sarah). Deadline: 2 weeks. Success criteria: Alerts tested and verified in staging.”
Every action item requires:
- Specific task: What exactly will be done?
- Single owner: Who is responsible? One person, not a team.
- Deadline: When will it be completed?
- Success criteria: How do we verify completion?
- Priority: Must-fix, should-fix, or nice-to-have?
Prioritize Ruthlessly
Major incidents often generate dozens of potential improvements. Do not try to fix everything.
Must-fix items prevent recurrence of this exact issue:
- Missing monitoring that would have detected the problem earlier
- Runbook gaps that delayed response
- Architecture changes that eliminate the failure mode
Should-fix items reduce likelihood or impact of similar issues:
- Improved escalation policies
- Better testing procedures
- Enhanced communication templates
Nice-to-have items are general improvements tangentially related:
- Refactoring old code
- Upgrading dependencies
- Process documentation updates
Focus on must-fix items. Commit to 3-5 critical changes rather than 20 vague improvements. Overcommitment guarantees nothing gets completed.
Assign Organizational-Level Changes
Major incidents often require changes beyond technical fixes:
Process Changes:
- New architectural review requirements
- Updated deployment approval workflows
- Enhanced testing mandates
- Revised escalation policies
Organizational Changes:
- Dedicated team ownership reassignments
- Cross-team coordination improvements
- Knowledge sharing mechanisms
- Training programs or skill development
Cultural Changes:
- Blameless culture reinforcement
- Psychological safety improvements
- Incident response practice exercises
- Executive involvement in reviews
These changes require leadership sponsorship. Major incident reviews should include stakeholders with authority to approve organizational changes, not just technical teams proposing them.
Documentation and Follow-Through
The review meeting ends. Now comes the harder part: implementing changes and documenting learnings.
Write the Post-Incident Report
Document the major incident review within 48 hours while discussion is fresh. External tools like Google Docs, Confluence, or Notion work well for collaborative postmortem documentation, while platforms like Upstat maintain the incident timeline, participant coordination, and MTTR data that informed the analysis.
Essential sections:
Executive Summary:
- What happened (one paragraph)
- Customer impact (metrics)
- Root cause (one sentence)
- Remediation status
Detailed Timeline:
- Chronological sequence of events
- Key decision points and rationale
- Communication touchpoints
- Resolution steps
Root Cause Analysis:
- Contributing technical factors
- Contributing process factors
- Contributing organizational factors
- Systemic patterns identified
What Went Well / Poorly:
- Effective response elements to repeat
- Failure points requiring improvement
- Unexpected positive or negative findings
Action Items:
- Must-fix items with owners and deadlines
- Should-fix items for backlog
- Organizational changes required
Lessons Learned:
- Key takeaways for other teams
- Process improvements identified
- Knowledge gaps discovered
Make this document easily discoverable. Store it in shared documentation systems where engineers naturally look for incident history.
Track Action Item Completion
Action items without accountability become wishful thinking.
Assign a single owner to track overall action item completion—typically the incident lead or a designated program manager for major incidents.
Weekly check-ins:
- Review action item status
- Identify blockers
- Escalate overdue items
- Adjust timelines if needed
Visibility mechanisms:
- Dashboard showing completion progress
- Regular updates in team meetings
- Executive status reports for major incidents
- Automated reminders for approaching deadlines
Completion verification:
- Test new monitoring in staging
- Verify runbook updates during incident simulations
- Confirm process changes through execution
- Validate architectural changes via code review
Do not close the major incident until critical action items are verified complete. Declaring victory while fixes remain unimplemented guarantees recurrence.
Share Learnings Broadly
Major incident knowledge should not stay siloed within the response team.
Internal sharing:
- Brief all engineering teams on findings
- Update runbooks and documentation
- Incorporate lessons into training programs
- Share relevant findings with leadership
Cross-team sharing:
- If other teams could encounter similar issues, proactively brief them
- Update shared infrastructure documentation
- Propose organization-wide process improvements
External sharing:
- Some companies publish sanitized postmortems publicly
- This builds customer trust through transparency
- Contributes to industry knowledge
- Demonstrates commitment to improvement
Building Continuous Improvement Culture
Major incident reviews are not isolated events. They are part of continuous organizational learning.
Quarterly pattern analysis:
- Review all major incidents from the quarter
- Identify recurring themes
- Spot emerging risks
- Track improvement metric trends
Incident review retrospectives:
- Periodically review the review process itself
- Are reviews producing meaningful changes?
- Are action items getting completed?
- Is blameless culture holding?
Metrics that measure learning:
- MTTR improvement over time
- Incident recurrence rates
- Action item completion percentages
- Time from incident to review completion
Culture indicators:
- Teams request reviews proactively
- Engineers volunteer honest context
- Action items focus on systems, not people
- Improvement velocity accelerates
The best organizations treat major incidents as tuition paid for expensive lessons. The worst waste those lessons by rushing back to normal operations without implementing meaningful changes.
Common Major Incident Review Mistakes
Mistake 1: Delaying the Review
Waiting weeks to conduct reviews guarantees information loss. Engineers forget context. Logs rotate. Details fade.
Hold major incident reviews within 2-5 business days. Earlier than 48 hours risks exhausted participants. Later than one week risks incomplete analysis.
Mistake 2: Stopping at Technical Root Causes
“The database query was inefficient” is not a complete root cause for a major incident.
Why was the inefficient query deployed? Why did testing not catch it? Why were safeguards insufficient? Why did monitoring not detect performance degradation earlier?
Technical fixes are necessary but not sufficient. Organizational and process gaps enabled the technical failure.
Mistake 3: Too Many Participants
Review meetings with 20 people become unfocused discussions where critical voices go unheard.
Limit core participants to 5-12 people: incident responders, subject matter experts, and key stakeholders. Use asynchronous documentation review for broader audiences.
Mistake 4: No Executive Accountability
Major incidents often reveal organizational gaps requiring executive decisions—budget for infrastructure improvements, headcount for critical teams, process changes affecting multiple departments.
Without executive participation or sponsorship, major incident reviews produce recommendations without authority to implement them.
Mistake 5: Treating Symptoms, Not Systems
Fixing the immediate technical issue without addressing systemic enablers guarantees recurrence with different symptoms.
If every quarterly major incident involves database capacity, the problem is not individual database configurations. The problem is inadequate capacity planning processes or insufficient investment in infrastructure.
Conclusion: Learning as Organizational Capability
Major incidents are expensive. Customer trust erodes. Revenue is lost. Engineering teams face stress and fatigue. The only way to recoup that cost is through learning that prevents future failures.
Effective major incident review processes transform high-impact outages into organizational capabilities. They force honest examination of technical debt, process gaps, and cultural weaknesses. They produce documented knowledge that compounds over time. They build muscle memory for responding to novel failures.
But only if teams commit to structured preparation, blameless analysis, specific action items, and rigorous follow-through.
Platforms like Upstat provide the incident management foundation—capturing timelines, tracking participants, calculating MTTR metrics, and maintaining searchable incident history. This detailed operational data informs thorough major incident reviews without requiring manual reconstruction of events.
The teams that master major incident reviews build competitive advantage. They learn faster than competitors. They prevent catastrophic repeat failures. They develop pattern recognition that makes novel incidents feel familiar.
The teams that skip reviews or execute them poorly repeat expensive mistakes until customers lose faith and engineering talent seeks employers that value learning.
Your systems will experience major incidents. The question is whether you will extract the maximum learning value from them.
Explore In Upstat
Capture complete incident timelines, participant actions, and MTTR metrics automatically—providing the detailed data foundation teams need for thorough major incident reviews.
