A critical alert fires at 2 AM. The on-call engineer receives a page. They acknowledge, open the monitoring dashboard, check which services are affected, determine the severity, notify stakeholders, update the status page, start the incident channel, page additional responders, and begin diagnosis.
That sequence takes 15-20 minutes before any actual troubleshooting happens. Most of those tasks follow predictable patterns. They could be automated.
The question is not whether to automate incident management, but what to automate and what to leave for human judgment.
The Automation Principle
Automate tasks that are repetitive, time-sensitive, and rule-based. Keep humans focused on tasks requiring judgment, context, or creative problem-solving.
This principle separates incident management into two categories:
Good candidates for automation:
- Alert routing based on service ownership
- Initial responder notifications
- Status page state changes
- Escalation triggers after timeout periods
- Data gathering for responder context
- Communication channel creation
Tasks requiring human judgment:
- Root cause analysis
- Remediation decisions
- Complex triage choices
- Customer communication content
- Post-incident analysis
- Priority conflicts between incidents
The boundary is not fixed. As patterns emerge and rules become clearer, tasks shift from human-required to automatable.
Alert Routing Automation
The problem: Engineers receive alerts for services they do not own. Critical alerts get lost in noise. On-call rotations are not reflected in alert destinations.
What to automate:
- Route alerts based on service ownership in your service catalog
- Match alert severity to notification urgency
- Respect on-call schedules automatically
- Suppress alerts during maintenance windows
- Deduplicate related alerts into single notifications
Alert routing automation ensures the right person receives alerts without manual intervention. When a database monitor fails, automation checks which team owns that database, identifies the current on-call engineer, and routes the notification through the appropriate channel.
What stays manual: Defining routing rules, setting ownership, and handling edge cases where automatic routing fails.
Notification Automation
The problem: Responders waste time sending the same notifications repeatedly. Stakeholders complain about inconsistent updates. Communication delays while responders focus on diagnosis.
What to automate:
- Initial incident notification to the response team
- Stakeholder alerts based on incident severity
- Periodic reminder notifications for active incidents
- Resolution notifications when incidents close
- Escalation notifications when acknowledgment times out
Notification automation means responders do not manually page team members or email stakeholders. When an incident opens at severity 1, automation immediately notifies the on-call team, pages the engineering manager, and alerts customer success.
What stays manual: Crafting specific update messages, deciding who needs additional context, and handling unusual stakeholder requirements.
Status Page Automation
The problem: Status pages show “operational” while services are degraded. Engineers forget to update status during firefighting. Status recovers before anyone updates the page.
What to automate:
- Status page state changes based on monitor health
- Automatic degraded/down status when monitors fail
- Automatic recovery when monitors return healthy
- Component-level status updates tied to specific monitors
- Scheduled maintenance window announcements
When monitors detect a problem, the status page updates immediately without waiting for an engineer to remember. When the service recovers, the status page reflects that recovery in seconds.
Upstat provides catalog-driven status pages where components map directly to monitoring configuration. When a health check fails, the corresponding status page component updates automatically.
What stays manual: Root cause descriptions, estimated resolution times, and detailed customer-facing communications during major incidents.
Escalation Automation
The problem: Alerts go unacknowledged because the primary responder is unavailable. Secondary responders do not know they need to step in. Critical incidents sit without response while teams assume someone else is handling it.
What to automate:
- Time-based escalation to secondary on-call
- Severity-based escalation to management
- Multi-tier escalation chains
- Acknowledgment tracking and timeout triggers
- Escalation notifications with full context
If the primary on-call engineer does not acknowledge within five minutes, automation escalates to the secondary responder. If neither acknowledges within fifteen minutes, the team lead receives notification. The escalation chain continues until someone responds.
What stays manual: Judgment calls about when escalation is premature, decisions to skip levels, and handling escalations that reach executive levels.
Context Gathering Automation
The problem: Responders spend the first 10 minutes gathering information that could be prepared in advance. Dashboard links, recent deployments, and related alerts require manual hunting.
What to automate:
- Include recent deployment information in incident context
- Attach relevant runbook links automatically
- Surface related alerts from the same time window
- Provide links to affected service dashboards
- Show recent similar incidents for reference
When an incident opens, automation gathers context responders will need: links to relevant dashboards, recent changes to affected services, associated runbooks, and related alerts. This information appears in the incident channel or page without manual effort.
What stays manual: Interpreting the context, identifying which information matters, and recognizing patterns across gathered data.
Incident Channel Automation
The problem: Engineers manually create incident channels. Channel naming is inconsistent. Relevant people are not invited. Channel history is lost after resolution.
What to automate:
- Channel creation when incidents open at certain severities
- Consistent naming based on incident identifiers
- Automatic invitation of relevant responders
- Pinning of incident details and runbook links
- Channel archival after incident resolution
When a severity 1 incident opens, automation creates a dedicated response channel, invites the on-call team, posts incident context, and pins relevant links. Responders arrive in a prepared workspace rather than scrambling to set one up.
What stays manual: Deciding who else to invite, managing conversation flow, and synthesizing discussion into action items.
What Not to Automate
Some incident management tasks actively resist automation because they require judgment that rules cannot capture.
Root cause analysis requires understanding system behavior, recognizing anomalies, and connecting symptoms to causes. Automation can gather data, but humans must interpret it.
Remediation decisions involve trade-offs between speed and safety, short-term fixes and long-term solutions. Automated remediation exists but carries significant risk for complex systems.
Customer communication content requires empathy, context awareness, and calibration of technical detail. Automated status updates work; automated apology emails do not.
Priority decisions between incidents require business context that automation cannot fully capture. Which incident matters more when both are severe?
Post-incident analysis demands critical thinking about process, culture, and system design. Automation can schedule the meeting but cannot conduct the retrospective.
Building Automation Incrementally
Start with the highest-value, lowest-risk automations.
First tier: Notification and routing These automations have immediate impact and low risk. If routing is wrong, humans can correct it. If notifications fail, manual processes still work.
Second tier: Status page integration Status page automation requires more confidence in monitoring accuracy. False positives become public when status pages update automatically.
Third tier: Escalation policies Escalation automation requires accurate on-call schedules and appropriate timeout values. Misconfigured escalation creates either alert storms or missed incidents.
Fourth tier: Context gathering and channel automation These automations improve efficiency but are not critical. They depend on integrations with deployment systems, documentation, and communication tools.
Each tier builds on the previous. Stable notification routing makes escalation automation reliable. Accurate monitoring makes status page automation trustworthy.
Event-Driven Automation Design
Modern incident automation operates on events rather than schedules. When something happens, automation responds.
Common trigger events:
- Monitor status changes (healthy to unhealthy)
- Incident state changes (open, acknowledged, resolved)
- Time-based thresholds (acknowledgment timeout)
- Heartbeat failures (expected signals missing)
Conditional logic determines actions:
- If severity is 1, notify the engineering manager
- If the service is customer-facing, update the status page
- If the incident is not acknowledged within 5 minutes, escalate
Event-driven design enables responsive automation without polling or scheduled checks. The system reacts to state changes in real time.
Upstat implements event-driven automation through workflow definitions that specify triggers, conditions, and actions. When monitor status changes or incidents transition between states, matching workflows execute their defined actions.
Measuring Automation Effectiveness
Track whether automation improves incident response.
Time to first response should decrease as notifications reach responders faster.
Escalation frequency might increase initially as automation catches previously-missed timeouts, then decrease as primary response improves.
Status page accuracy should improve as automatic updates replace forgotten manual changes.
Responder context time should decrease as automation provides relevant information upfront.
Automation that does not improve these metrics needs reevaluation. The goal is faster, more effective incident response, not automation for its own sake.
Getting Started
Begin with a single automation that addresses your most painful manual task.
If responders constantly page the wrong person, automate routing.
If status pages lag behind reality, automate status updates.
If escalations are inconsistent, automate escalation triggers.
Each successful automation builds confidence for the next. Over time, incident management becomes a hybrid of automated routine tasks and human judgment where it matters most.
The goal is not to remove humans from incident response. The goal is to free humans from repetitive tasks so they can focus on the complex decisions that require human judgment.
Explore In Upstat
Automate incident workflows with event-driven triggers, conditional routing, automatic status updates, and escalation policies that respond in seconds.
