When an alert fires at 3 AM, your on-call engineer needs to make a critical decision: Is this a major incident requiring immediate escalation, or a minor issue that can wait until morning? Getting this classification wrong can mean wasted resources on false alarms or delayed response to critical problems.
Incident classification isn’t just about labeling problems. It’s about creating a shared language that helps your team prioritize effectively, allocate resources appropriately, and respond with the right level of urgency.
Why Incident Classification Matters
Without clear classification criteria, every team member applies their own judgment. What one engineer considers critical, another might see as routine. This inconsistency leads to alert fatigue, miscommunication, and wasted effort.
A well-defined classification system provides several benefits:
Faster decision-making: Engineers can quickly assess severity without consulting multiple people or second-guessing their judgment.
Appropriate resource allocation: Major incidents get the full team’s attention, while minor issues are handled efficiently without over-escalation.
Clear communication: When you declare a “SEV-1 incident,” everyone understands the implications and their expected role.
Better metrics and learning: Consistent classification enables meaningful analysis of incident patterns and improvement opportunities.
Defining Major Incidents
Major incidents share common characteristics that distinguish them from routine issues:
Business impact: Services that directly affect customers or revenue are degraded or unavailable. This includes complete outages, significant performance degradation, or data integrity issues affecting multiple users.
Scope and scale: The problem affects a substantial portion of your user base or critical business operations. A bug affecting one user is typically minor; the same bug affecting thousands requires major incident response.
Urgency: The issue requires immediate attention and cannot wait for normal business hours. Major incidents often trigger paging and weekend responses.
Resolution complexity: Major incidents typically require coordination across multiple teams, external communication, and executive visibility.
Common examples include complete application outages, payment processing failures, data loss events, security breaches, and critical third-party service failures affecting your core functionality.
Defining Minor Incidents
Minor incidents are real problems that need fixing, but they don’t meet the threshold for major incident response:
Limited impact: Issues affect a small subset of users, non-critical features, or internal tools. The business continues operating normally.
Degraded but functional: The service works but with reduced performance or missing non-essential features. Users experience inconvenience rather than complete inability to work.
Deferrable: The problem can be addressed during business hours without significant business risk.
Single team resolution: One team can handle the investigation and fix without extensive coordination.
Examples include minor UI bugs, performance issues affecting a small user segment, non-critical integration failures, and isolated edge case problems.
Building Your Severity Level Framework
Most organizations use a three to five level severity system. Here’s a practical five-level framework:
Severity 1 (Critical): Complete outage or critical functionality unavailable. Major customer impact. Revenue at risk. Immediate response required regardless of time.
Severity 2 (High): Significant degradation affecting many users. Core features impacted but workarounds exist. Response required within hours.
Severity 3 (Medium): Noticeable issues affecting some users or non-critical features. Response within one business day.
Severity 4 (Low): Minor issues with minimal user impact. Resolution within a week.
Severity 5 (Trivial): Cosmetic issues, minor bugs, or enhancement requests. Addressed as resources allow.
The key is defining clear, measurable criteria for each level. Avoid subjective terms like “important” or “serious” without concrete definitions.
Key Classification Factors
When assessing an incident, consider these dimensions:
Impact scope: How many users or systems are affected? Is it organization-wide, team-specific, or isolated to individuals?
Business criticality: Does this affect revenue, customer trust, regulatory compliance, or core business operations?
Urgency: How quickly must this be resolved? What happens if we wait until tomorrow?
Workarounds: Can users accomplish their goals through alternative methods, or are they completely blocked?
Trend and trajectory: Is the problem stable, improving, or getting worse? A small issue that’s spreading rapidly may warrant higher classification.
Making Classification Decisions
Start with a default classification based on initial information, but be ready to adjust. Incidents often begin as minor issues that escalate, or major incidents that turn out to be less severe than initially thought.
Create decision trees or flowcharts that help on-call engineers quickly assess severity. For example:
“Is the primary application unavailable?” → Yes → SEV-1
“Can users complete core workflows?” → No → SEV-1 or SEV-2
“Are workarounds available?” → Yes → Consider SEV-3
Empower engineers to make classification decisions and adjust as new information emerges. Better to escalate quickly and de-escalate later than to underestimate severity.
Common Classification Mistakes
Over-classification: Treating every issue as critical leads to alert fatigue and burnout. Save major incident response for truly major problems.
Under-classification: Minimizing genuine problems because they’re inconvenient or “not our fault” delays appropriate response.
Static classification: Failing to adjust severity as circumstances change. An incident that starts minor can escalate; major incidents can be downgraded once contained.
Ignoring business context: A technical issue that seems minor might have major business implications during a critical event or for a key customer.
Using UpStat for Incident Classification
UpStat helps teams implement consistent incident classification through a five-level severity system. When creating incidents, you assign severity levels from 1 (highest priority) through 5 (lowest priority) that drive notification routing and escalation policies.
Each incident’s severity automatically determines alert routing based on your configured escalation rules. Level 1 incidents can trigger immediate paging and escalation chains, while level 5 incidents route through standard channels. This structured approach ensures consistent response patterns across your team.
UpStat also supports custom status workflows and labels, allowing you to capture additional context beyond basic severity classification. This flexibility lets you adapt the system to your team’s specific needs while maintaining consistency across your incident response process.
Building a Classification Culture
Technology alone doesn’t solve classification challenges. Build a culture where:
Classification is revisited: Encourage teams to adjust severity as situations evolve. Make it easy to escalate or de-escalate.
Learning is valued: Review classification decisions during post-incident reviews. Did we classify correctly? What would we change?
Criteria are updated: As your business evolves, your classification criteria should too. Quarterly reviews ensure your framework stays relevant.
Context is shared: Document why specific incidents received their classification. This helps future responders make better decisions.
Conclusion
Effective incident classification isn’t about perfection. It’s about creating a consistent, practical framework that helps your team make fast, appropriate decisions under pressure.
Start with clear definitions for major and minor incidents. Build a severity level system with objective criteria. Empower engineers to make classification decisions and adjust as needed. Review and refine your approach based on actual incident data.
The goal is simple: when something goes wrong, your team should spend their time fixing the problem, not debating how serious it is.
Explore In Upstat
Configure up to five custom severity levels that match your organization's framework. Set automated alert routing and escalation policies based on incident severity.