Runbook Examples

Use these examples as starting points for creating your own runbooks. Adapt them to match your specific environment and needs.

Service Restart Runbook

A basic template for restarting services safely.

Title: Generic Service Restart

Description: Standard procedure for restarting a service with proper checks.

Steps:

Step 1: Check Current Status
- SSH to server
- Run: systemctl status [service-name]
- Note the current state

Step 2: Notify Team
- Post in Slack: "Restarting [service] on [server]"
- Wait for acknowledgment if needed

Step 3: Stop Service
- Run: sudo systemctl stop [service-name]
- Verify stopped: systemctl status [service-name]

Step 4: Wait for Connections to Clear
- Wait 30 seconds for existing connections to close
- Check for remaining processes: ps aux | grep [service]

Step 5: Start Service
- Run: sudo systemctl start [service-name]
- Check status: systemctl status [service-name]

Step 6: Is Service Running?
- Check the status output
- If "active (running)" - Continue to Step 7
- If failed - Go to Step 10

Step 7: Verify Functionality
- Test main endpoint
- Check logs for errors
- Confirm working properly

Step 8: Notify Complete
- Update Slack: "Service restart complete"
- Close any related tickets

Step 10: Troubleshooting Failed Start
- Check logs: journalctl -u [service-name] -n 50
- Look for error messages
- Try starting in debug mode
- If still failing, escalate to senior engineer

Database Maintenance Runbook

For routine database maintenance tasks.

Title: PostgreSQL Maintenance

Description: Regular maintenance for PostgreSQL databases including vacuum and reindex.

Steps:

Step 1: Check Database Metrics
- Connect to database
- Check table sizes: SELECT table_name, pg_size_pretty(pg_total_relation_size(table_name)) FROM information_schema.tables
- Note any large tables

Step 2: Check Current Activity
- Run: SELECT * FROM pg_stat_activity WHERE state != 'idle'
- Ensure no long-running queries
- If busy - Go to Step 10 (Reschedule)

Step 3: Start Maintenance Mode
- Update status page
- Notify team in Slack

Step 4: Run Vacuum
- Execute: VACUUM ANALYZE;
- Monitor progress
- Note completion time

Step 5: Check Index Bloat
- Run bloat check query
- Identify indexes over 50% bloated

Step 6: Need Reindex?
- If indexes bloated - Continue to Step 7
- If not - Skip to Step 8

Step 7: Reindex Tables
- For each bloated index:
- Run: REINDEX INDEX [index_name];
- Track completion

Step 8: Verify Performance
- Run sample queries
- Check execution times
- Compare to baseline

Step 9: Exit Maintenance
- Update status page
- Notify team complete
- Document any issues

Step 10: Reschedule Procedure
- Database too busy
- Schedule for off-hours
- Notify team of delay

Deployment Rollback Runbook

Quick rollback procedure when deployments go wrong.

Title: Emergency Deployment Rollback

Description: Rollback to previous version when current deployment has issues.

Steps:

Step 1: Confirm Rollback Needed
- Verify the issue is deployment-related
- Get approval if needed
- Note the problem for post-mortem

Step 2: Identify Previous Version
- Check deployment history
- Find last known good version
- Note version number

Step 3: Stop Current Version
- Disable traffic to affected servers
- Stop application services
- Wait for requests to complete

Step 4: Deploy Previous Version
- Run deployment script with old version
- Example: ./deploy.sh --version [previous-version]
- Monitor deployment progress

Step 5: Deployment Successful?
- Check deployment logs
- If success - Continue to Step 6
- If failed - Go to Step 10

Step 6: Start Services
- Start application services
- Enable traffic flow
- Monitor startup logs

Step 7: Verify Functionality
- Test critical endpoints
- Check error rates
- Monitor for 5 minutes

Step 8: Is Everything Stable?
- If yes - Continue to Step 9
- If no - Go to Step 15 (escalate)

Step 9: Document and Notify
- Update incident ticket
- Notify team of rollback
- Schedule post-mortem

Step 10: Deployment Failed
- Check disk space
- Verify permissions
- Try manual deployment
- If still failing - Go to Step 15

Step 15: Escalate to Senior Staff
- Page on-call senior engineer
- Provide all error details
- Stand by to assist

SSL Certificate Renewal

Don’t let certificates expire!

Title: SSL Certificate Renewal

Description: Process for renewing SSL certificates before expiration.

Steps:

Step 1: Check Expiration Dates
- Run: openssl x509 -enddate -noout -in /path/to/cert.pem
- Note expiration date
- Verify this is the correct cert

Step 2: Generate CSR
- Create new private key if needed
- Generate CSR: openssl req -new -key private.key -out renewal.csr
- Verify CSR details

Step 3: Submit to Certificate Authority
- Log into CA portal
- Submit CSR
- Select validation method

Step 4: Complete Validation
- Follow CA's validation process
- This varies by provider
- Wait for approval email

Step 5: Download New Certificate
- Download from CA portal
- Save certificate files
- Verify certificate details

Step 6: Install Certificate
- Backup current certificate
- Copy new cert to server
- Update configuration files

Step 7: Restart Services
- Restart web server
- Check service status
- Monitor error logs

Step 8: Verify Installation
- Test with: openssl s_client -connect domain.com:443
- Check browser shows correct cert
- Verify no security warnings

Step 9: Cleanup
- Remove old certificate files
- Update documentation
- Set renewal reminder

Using These Templates

Customization Steps

  1. Replace placeholders

    • [service-name] with your actual service
    • [server] with your server names
    • Update commands for your environment
  2. Add specific details

    • Your actual commands
    • Your server addresses
    • Your notification channels
  3. Adjust for your needs

    • Add or remove steps
    • Change decision points
    • Include your tools

Best Practices

  • Test templates in non-production first
  • Keep templates updated
  • Share successful runbooks with team
  • Build a library over time

Creating Your Own

  1. Start with a real incident or task
  2. Document what you did
  3. Add decision points where needed
  4. Test with a colleague
  5. Refine based on feedback

Learn more