Automation Examples & Templates

Learn from automation examples that you can copy and customize for your organization. Each example includes the configuration and explains the use case.

Incident Response Automations

Critical Service Down - Response

Automation for critical service failures.

Use Case: When a critical production service fails, automatically orchestrate the complete incident response.

Configuration:

  • Trigger: Monitor Status Change (UP → DOWN)
  • Conditions:
    • Monitor has tag: tier-1 OR critical
    • Monitor type: HTTP or TCP
    • Time: Not during maintenance window
  • Actions:
    1. Create P1 incident with title: ”[CRITICAL][monitor.name] is DOWN”
    2. Link affected catalog entities
    3. Page primary on-call engineer
    4. Send Slack notification to #incidents with @here mention
    5. Create incident channel #inc-[incident.id]
    6. Execute “Service Recovery” runbook
    7. Start 5-minute escalation timer

Flapping Service Detection

Detect and handle services that are unstable.

Use Case: Identify monitors that are flapping (going up/down repeatedly) and take appropriate action.

Configuration:

  • Trigger: Monitor Status Change
  • Conditions:
    • Same monitor has changed status 5+ times in 10 minutes
    • Current status: DOWN
  • Actions:
    1. Create incident: “Flapping detected: [monitor.name]”
    2. Set incident priority: P3
    3. Disable monitor alerts for 30 minutes
    4. Notify responsible team via Slack
    5. Add comment with flapping history

Alert Management Automations

Alert Suppression

Reduce alerts during widespread outages.

Use Case: When a core service fails, suppress alerts from dependent services to reduce noise.

Configuration:

  • Trigger: Incident Created
  • Conditions:
    • Incident severity: P1 or P2
    • Affected entity type: “Core Service” or “Database”
  • Actions:
    1. Query dependent services via catalog relationships
    2. Suppress alerts for dependent monitors for 1 hour
    3. Add note to incident: “Suppressed [count] dependent alerts”
    4. Send summary to incident channel

Business Hours Routing

Route alerts differently based on business hours.

Use Case: Send non-critical alerts to Slack during business hours, page on-call after hours.

Configuration:

  • Trigger: Monitor Alert
  • Conditions:
    • Severity: P3 or P4
    • Time: Monday-Friday 9 AM - 6 PM (team timezone)
  • Actions (Business Hours):
    1. Send to team Slack channel
    2. Create incident with 2-hour SLA
    3. Assign to daily rotation
  • Actions (After Hours):
    1. Check if monitor critical path
    2. If yes: Page on-call
    3. If no: Queue for next business day

Maintenance Automations

Scheduled Maintenance Window

Automate recurring maintenance windows.

Use Case: Weekly database maintenance that requires alert suppression.

Configuration:

  • Trigger: Schedule (Cron: 0 2 * * 0)
  • Conditions:
    • Day: Sunday
    • No P1/P2 incidents active
  • Actions:
    1. Create maintenance window (2 hours)
    2. Disable alerts for monitors tagged “database”
    3. Post to #engineering: “Weekly DB maintenance starting”
    4. Update status page: “Scheduled maintenance in progress”
    5. After 2 hours: Re-enable alerts, close maintenance

Pre-Deployment Checks

Ensure safety before automated deployments.

Use Case: Before deploying, verify system health and create safety nets.

Configuration:

  • Trigger: Webhook (from CI/CD pipeline)
  • Conditions:
    • All production monitors: UP
    • No active P1/P2 incidents
    • Not during business hours
  • Actions:
    1. Create deployment tracking incident
    2. Snapshot current system metrics
    3. Enable enhanced monitoring mode
    4. Notify on-call: “Deployment starting for [service]”
    5. Set 30-minute auto-rollback timer

Escalation Automations

Tiered Escalation Chain

Progressive escalation based on acknowledgment.

Use Case: Ensure critical incidents get attention by escalating through the chain of command.

Configuration:

  • Trigger: Incident Created
  • Conditions:
    • Severity: P1
    • Status: Unacknowledged
  • Actions (Immediate):
    1. Page primary on-call
    2. Send SMS to primary
    3. Start 5-minute timer
  • Actions (After 5 min):
    1. Page secondary on-call
    2. Call primary on-call
    3. Start 5-minute timer
  • Actions (After 10 min):
    1. Page engineering manager
    2. Page entire on-call team
    3. Send executive summary

SLA Breach Prevention

Proactive escalation before SLA breach.

Use Case: Escalate incidents approaching their SLA deadline.

Configuration:

  • Trigger: Time-based check (every 5 minutes)
  • Conditions:
    • Incident has SLA defined
    • Time until breach Less than 30 minutes
    • Status: Not resolved
  • Actions:
    1. Update severity to P2 (if lower)
    2. Page incident owner
    3. Notify team lead: “SLA breach imminent”
    4. Add timeline comment
    5. Trigger status page update

Integration Automations

Bi-Directional Ticket Sync

Keep external ticketing systems in sync.

Use Case: Automatically create and sync tickets with Jira/ServiceNow.

Configuration:

  • Trigger: Incident Created/Updated
  • Conditions:
    • Incident duration Greater than 15 minutes
    • Not tagged “no-ticket”
  • Actions (Create):
    1. Create Jira ticket via webhook
    2. Set ticket priority = incident severity
    3. Add ticket link to incident
    4. Tag incident with “jira-synced”
  • Actions (Update):
    1. Update ticket status
    2. Sync comments bi-directionally
    3. Update resolution notes

ChatOps Commands

Enable chat-based incident management.

Use Case: Allow team to manage incidents directly from Slack.

Configuration:

  • Trigger: Slack slash command (/incident)
  • Conditions:
    • User has incident responder role
    • Command is valid
  • Actions:
    1. Parse command (create/update/resolve)
    2. Execute requested action
    3. Update incident timeline
    4. Reply in Slack with confirmation
    5. Update incident channel topic

Reporting Automations

Daily Incident Summary

Automated daily reports for leadership.

Use Case: Send daily incident summary to management.

Configuration:

  • Trigger: Schedule (Daily 9 AM)
  • Conditions:
    • Weekday (Mon-Fri)
  • Actions:
    1. Query incidents from last 24 hours
    2. Generate summary statistics
    3. Include MTTR by severity
    4. List ongoing P1/P2 incidents
    5. Email to leadership with trends

Post-Incident Auto-Review

Schedule post-mortem reviews automatically.

Use Case: Ensure P1 incidents get proper post-mortem review.

Configuration:

  • Trigger: Incident Resolved
  • Conditions:
    • Severity: P1
    • Duration Greater than 30 minutes
    • Has customer impact
  • Actions:
    1. Wait 1 business day
    2. Create calendar event for post-mortem
    3. Invite incident responders
    4. Create post-mortem document template
    5. Set reminder for action items

Best Practices for Using These Examples

  1. Customize for Your Needs: These are starting points - modify for your environment
  2. Test First: Always test with non-production monitors first
  3. Start Small: Implement one automation at a time
  4. Monitor Results: Review automation logs regularly
  5. Iterate: Refine based on real-world performance
  6. Document Changes: Keep notes on why you modified examples

Creating Your Own

To implement any of these examples:

  1. Go to AutomationsCreate New
  2. Copy the configuration from the example
  3. Adjust triggers, conditions, and actions for your needs
  4. Test with a dry run
  5. Enable when confident

Need Help?