Runbook Examples
Use these examples as starting points for creating your own runbooks. Adapt them to match your specific environment and needs.
Service Restart Runbook
A basic template for restarting services safely.
Title: Generic Service Restart
Description: Standard procedure for restarting a service with proper checks.
Steps:
Step 1: Check Current Status
- SSH to server
- Run: systemctl status [service-name]
- Note the current state
Step 2: Notify Team
- Post in Slack: "Restarting [service] on [server]"
- Wait for acknowledgment if needed
Step 3: Stop Service
- Run: sudo systemctl stop [service-name]
- Verify stopped: systemctl status [service-name]
Step 4: Wait for Connections to Clear
- Wait 30 seconds for existing connections to close
- Check for remaining processes: ps aux | grep [service]
Step 5: Start Service
- Run: sudo systemctl start [service-name]
- Check status: systemctl status [service-name]
Step 6: Is Service Running?
- Check the status output
- If "active (running)" - Continue to Step 7
- If failed - Go to Step 10
Step 7: Verify Functionality
- Test main endpoint
- Check logs for errors
- Confirm working properly
Step 8: Notify Complete
- Update Slack: "Service restart complete"
- Close any related tickets
Step 10: Troubleshooting Failed Start
- Check logs: journalctl -u [service-name] -n 50
- Look for error messages
- Try starting in debug mode
- If still failing, escalate to senior engineer Database Maintenance Runbook
For routine database maintenance tasks.
Title: PostgreSQL Maintenance
Description: Regular maintenance for PostgreSQL databases including vacuum and reindex.
Steps:
Step 1: Check Database Metrics
- Connect to database
- Check table sizes: SELECT table_name, pg_size_pretty(pg_total_relation_size(table_name)) FROM information_schema.tables
- Note any large tables
Step 2: Check Current Activity
- Run: SELECT * FROM pg_stat_activity WHERE state != 'idle'
- Ensure no long-running queries
- If busy - Go to Step 10 (Reschedule)
Step 3: Start Maintenance Mode
- Update status page
- Notify team in Slack
Step 4: Run Vacuum
- Execute: VACUUM ANALYZE;
- Monitor progress
- Note completion time
Step 5: Check Index Bloat
- Run bloat check query
- Identify indexes over 50% bloated
Step 6: Need Reindex?
- If indexes bloated - Continue to Step 7
- If not - Skip to Step 8
Step 7: Reindex Tables
- For each bloated index:
- Run: REINDEX INDEX [index_name];
- Track completion
Step 8: Verify Performance
- Run sample queries
- Check execution times
- Compare to baseline
Step 9: Exit Maintenance
- Update status page
- Notify team complete
- Document any issues
Step 10: Reschedule Procedure
- Database too busy
- Schedule for off-hours
- Notify team of delay Deployment Rollback Runbook
Quick rollback procedure when deployments go wrong.
Title: Emergency Deployment Rollback
Description: Rollback to previous version when current deployment has issues.
Steps:
Step 1: Confirm Rollback Needed
- Verify the issue is deployment-related
- Get approval if needed
- Note the problem for post-mortem
Step 2: Identify Previous Version
- Check deployment history
- Find last known good version
- Note version number
Step 3: Stop Current Version
- Disable traffic to affected servers
- Stop application services
- Wait for requests to complete
Step 4: Deploy Previous Version
- Run deployment script with old version
- Example: ./deploy.sh --version [previous-version]
- Monitor deployment progress
Step 5: Deployment Successful?
- Check deployment logs
- If success - Continue to Step 6
- If failed - Go to Step 10
Step 6: Start Services
- Start application services
- Enable traffic flow
- Monitor startup logs
Step 7: Verify Functionality
- Test critical endpoints
- Check error rates
- Monitor for 5 minutes
Step 8: Is Everything Stable?
- If yes - Continue to Step 9
- If no - Go to Step 15 (escalate)
Step 9: Document and Notify
- Update incident ticket
- Notify team of rollback
- Schedule post-mortem
Step 10: Deployment Failed
- Check disk space
- Verify permissions
- Try manual deployment
- If still failing - Go to Step 15
Step 15: Escalate to Senior Staff
- Page on-call senior engineer
- Provide all error details
- Stand by to assist SSL Certificate Renewal
Don’t let certificates expire!
Title: SSL Certificate Renewal
Description: Process for renewing SSL certificates before expiration.
Steps:
Step 1: Check Expiration Dates
- Run: openssl x509 -enddate -noout -in /path/to/cert.pem
- Note expiration date
- Verify this is the correct cert
Step 2: Generate CSR
- Create new private key if needed
- Generate CSR: openssl req -new -key private.key -out renewal.csr
- Verify CSR details
Step 3: Submit to Certificate Authority
- Log into CA portal
- Submit CSR
- Select validation method
Step 4: Complete Validation
- Follow CA's validation process
- This varies by provider
- Wait for approval email
Step 5: Download New Certificate
- Download from CA portal
- Save certificate files
- Verify certificate details
Step 6: Install Certificate
- Backup current certificate
- Copy new cert to server
- Update configuration files
Step 7: Restart Services
- Restart web server
- Check service status
- Monitor error logs
Step 8: Verify Installation
- Test with: openssl s_client -connect domain.com:443
- Check browser shows correct cert
- Verify no security warnings
Step 9: Cleanup
- Remove old certificate files
- Update documentation
- Set renewal reminder Using These Templates
Customization Steps
Replace placeholders
- [service-name] with your actual service
- [server] with your server names
- Update commands for your environment
Add specific details
- Your actual commands
- Your server addresses
- Your notification channels
Adjust for your needs
- Add or remove steps
- Change decision points
- Include your tools
Best Practices
- Test templates in non-production first
- Keep templates updated
- Share successful runbooks with team
- Build a library over time
Creating Your Own
- Start with a real incident or task
- Document what you did
- Add decision points where needed
- Test with a colleague
- Refine based on feedback
Learn more