Automation Examples & Templates
Learn from automation examples that you can copy and customize for your organization. Each example includes the configuration and explains the use case.
Incident Response Automations
Critical Service Down - Response
Automation for critical service failures.
Use Case: When a critical production service fails, automatically orchestrate the complete incident response.
Configuration:
- Trigger: Monitor Status Change (UP → DOWN)
- Conditions:
- Monitor has tag:
tier-1ORcritical - Monitor type: HTTP or TCP
- Time: Not during maintenance window
- Monitor has tag:
- Actions:
- Create P1 incident with title: ”[CRITICAL][monitor.name] is DOWN”
- Link affected catalog entities
- Page primary on-call engineer
- Send Slack notification to #incidents with @here mention
- Create incident channel #inc-[incident.id]
- Execute “Service Recovery” runbook
- Start 5-minute escalation timer
Flapping Service Detection
Detect and handle services that are unstable.
Use Case: Identify monitors that are flapping (going up/down repeatedly) and take appropriate action.
Configuration:
- Trigger: Monitor Status Change
- Conditions:
- Same monitor has changed status 5+ times in 10 minutes
- Current status: DOWN
- Actions:
- Create incident: “Flapping detected: [monitor.name]”
- Set incident priority: P3
- Disable monitor alerts for 30 minutes
- Notify responsible team via Slack
- Add comment with flapping history
Alert Management Automations
Alert Suppression
Reduce alerts during widespread outages.
Use Case: When a core service fails, suppress alerts from dependent services to reduce noise.
Configuration:
- Trigger: Incident Created
- Conditions:
- Incident severity: P1 or P2
- Affected entity type: “Core Service” or “Database”
- Actions:
- Query dependent services via catalog relationships
- Suppress alerts for dependent monitors for 1 hour
- Add note to incident: “Suppressed [count] dependent alerts”
- Send summary to incident channel
Business Hours Routing
Route alerts differently based on business hours.
Use Case: Send non-critical alerts to Slack during business hours, page on-call after hours.
Configuration:
- Trigger: Monitor Alert
- Conditions:
- Severity: P3 or P4
- Time: Monday-Friday 9 AM - 6 PM (team timezone)
- Actions (Business Hours):
- Send to team Slack channel
- Create incident with 2-hour SLA
- Assign to daily rotation
- Actions (After Hours):
- Check if monitor critical path
- If yes: Page on-call
- If no: Queue for next business day
Maintenance Automations
Scheduled Maintenance Window
Automate recurring maintenance windows.
Use Case: Weekly database maintenance that requires alert suppression.
Configuration:
- Trigger: Schedule (Cron:
0 2 * * 0) - Conditions:
- Day: Sunday
- No P1/P2 incidents active
- Actions:
- Create maintenance window (2 hours)
- Disable alerts for monitors tagged “database”
- Post to #engineering: “Weekly DB maintenance starting”
- Update status page: “Scheduled maintenance in progress”
- After 2 hours: Re-enable alerts, close maintenance
Pre-Deployment Checks
Ensure safety before automated deployments.
Use Case: Before deploying, verify system health and create safety nets.
Configuration:
- Trigger: Webhook (from CI/CD pipeline)
- Conditions:
- All production monitors: UP
- No active P1/P2 incidents
- Not during business hours
- Actions:
- Create deployment tracking incident
- Snapshot current system metrics
- Enable enhanced monitoring mode
- Notify on-call: “Deployment starting for [service]”
- Set 30-minute auto-rollback timer
Escalation Automations
Tiered Escalation Chain
Progressive escalation based on acknowledgment.
Use Case: Ensure critical incidents get attention by escalating through the chain of command.
Configuration:
- Trigger: Incident Created
- Conditions:
- Severity: P1
- Status: Unacknowledged
- Actions (Immediate):
- Page primary on-call
- Send SMS to primary
- Start 5-minute timer
- Actions (After 5 min):
- Page secondary on-call
- Call primary on-call
- Start 5-minute timer
- Actions (After 10 min):
- Page engineering manager
- Page entire on-call team
- Send executive summary
SLA Breach Prevention
Proactive escalation before SLA breach.
Use Case: Escalate incidents approaching their SLA deadline.
Configuration:
- Trigger: Time-based check (every 5 minutes)
- Conditions:
- Incident has SLA defined
- Time until breach Less than 30 minutes
- Status: Not resolved
- Actions:
- Update severity to P2 (if lower)
- Page incident owner
- Notify team lead: “SLA breach imminent”
- Add timeline comment
- Trigger status page update
Integration Automations
Bi-Directional Ticket Sync
Keep external ticketing systems in sync.
Use Case: Automatically create and sync tickets with Jira/ServiceNow.
Configuration:
- Trigger: Incident Created/Updated
- Conditions:
- Incident duration Greater than 15 minutes
- Not tagged “no-ticket”
- Actions (Create):
- Create Jira ticket via webhook
- Set ticket priority = incident severity
- Add ticket link to incident
- Tag incident with “jira-synced”
- Actions (Update):
- Update ticket status
- Sync comments bi-directionally
- Update resolution notes
ChatOps Commands
Enable chat-based incident management.
Use Case: Allow team to manage incidents directly from Slack.
Configuration:
- Trigger: Slack slash command (/incident)
- Conditions:
- User has incident responder role
- Command is valid
- Actions:
- Parse command (create/update/resolve)
- Execute requested action
- Update incident timeline
- Reply in Slack with confirmation
- Update incident channel topic
Reporting Automations
Daily Incident Summary
Automated daily reports for leadership.
Use Case: Send daily incident summary to management.
Configuration:
- Trigger: Schedule (Daily 9 AM)
- Conditions:
- Weekday (Mon-Fri)
- Actions:
- Query incidents from last 24 hours
- Generate summary statistics
- Include MTTR by severity
- List ongoing P1/P2 incidents
- Email to leadership with trends
Post-Incident Auto-Review
Schedule post-mortem reviews automatically.
Use Case: Ensure P1 incidents get proper post-mortem review.
Configuration:
- Trigger: Incident Resolved
- Conditions:
- Severity: P1
- Duration Greater than 30 minutes
- Has customer impact
- Actions:
- Wait 1 business day
- Create calendar event for post-mortem
- Invite incident responders
- Create post-mortem document template
- Set reminder for action items
Best Practices for Using These Examples
- Customize for Your Needs: These are starting points - modify for your environment
- Test First: Always test with non-production monitors first
- Start Small: Implement one automation at a time
- Monitor Results: Review automation logs regularly
- Iterate: Refine based on real-world performance
- Document Changes: Keep notes on why you modified examples
Creating Your Own
To implement any of these examples:
- Go to Automations → Create New
- Copy the configuration from the example
- Adjust triggers, conditions, and actions for your needs
- Test with a dry run
- Enable when confident