The Problem with Starting from Scratch
It’s 3 AM. An alert fires for the payment service. Your on-call engineer acknowledges the page and opens their terminal. They know the service needs investigation, but where do they start? Check the database? Review recent deployments? Examine error logs? Restart the service?
About Runbook Structure: This guide teaches runbook structure best practices that teams implement in their own documentation systems (Git repositories, wikis, or documentation tools). These are team-maintained organizational patterns, not built-in platform features. Runbook management platforms like Upstat focus on execution tracking and decision branching—teams maintain their own structure standards for how procedures are written. Whether you copy template files in Git or follow consistent formats when creating runbooks, these principles ensure comprehensive coverage without constraining scenario-specific adaptation.
Without a runbook, the engineer improvises. They might solve the problem quickly—or they might waste an hour exploring dead ends. Even worse, the next engineer who faces this same issue will improvise differently, rediscovering the same troubleshooting path from scratch.
This is the cost of missing runbooks: inconsistent response, wasted time, and knowledge that stays locked in individual minds instead of captured in shared documentation.
Consistent runbook structure solves this by ensuring comprehensive coverage. Instead of creating procedures from scratch every time, teams follow proven frameworks that include all essential components. Structure doesn’t constrain creativity—it eliminates the need to reinvent basic organization, letting engineers focus on scenario-specific details.
Why Structure Matters for Operational Consistency
Consistent structure creates standardization without rigidity. It ensures every runbook contains the essentials while allowing customization for specific scenarios.
Speed: Starting with consistent structure is faster than blank-page documentation. Whether copying a template file or following a standard format, the framework exists—you fill in scenario-specific details.
Completeness: Structure prompts you to document everything responders need. Without it, teams forget critical sections like rollback procedures or escalation criteria.
Consistency: When all runbooks follow the same structure, engineers know exactly where to find specific information during incidents. Consistent formatting reduces cognitive load during high-pressure situations.
Quality: Well-designed templates incorporate best practices automatically. New team members create better runbooks because the template guides them toward comprehensive documentation.
Maintenance: Standardized structure makes updates easier. When every runbook follows the same format, improving one section’s approach can cascade across all procedures.
The goal isn’t bureaucratic uniformity. The goal is making runbooks immediately useful to anyone who needs them—especially during incidents when clarity matters most.
Essential Template Components
Effective runbook templates include these core sections. Not every runbook needs every section, but these form the foundation from which you customize.
Title and Metadata
Start with clear identification:
- Title: Descriptive name explaining what this runbook addresses
- Owner: Team or individual responsible for maintaining accuracy
- Last Updated: When content was last modified
- Last Validated: When procedure was last tested or confirmed to work
These metadata fields help teams track runbook health. If a runbook hasn’t been validated in six months, it needs review before the next incident relies on it.
Purpose and Scope
Explain what this runbook addresses and when to use it. Clear scope prevents confusion about applicability.
Example:
Purpose: Diagnose and resolve high API latency affecting the payment service.
Scope: Use this runbook when payment API P95 latency exceeds 2000ms for over 5 minutes. Does not cover database failover scenarios—see Database Failover Runbook for those procedures.
Explicit scope statements help engineers select the right runbook quickly.
Prerequisites
Document what must be true before executing this runbook:
- Required access permissions
- Tools or credentials needed
- System state expectations
- Prior procedures that should complete first
Example:
Prerequisites:
- SSH access to production payment servers
- Datadog account with payment-service dashboard access
- On-call rotation membership (required for production changes)
- Payment database credentials (stored in 1Password vault)
Clear prerequisites prevent engineers from starting procedures they can’t complete.
Step-by-Step Instructions
This is the heart of the runbook. Each step should be specific, actionable, and verifiable.
Poor step: “Check if the database is slow”
Better step:
Step 3: Check Database Query Performance
Run this query to identify slow queries in the last 15 minutes:
SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10; Expected result: Query execution times under 500ms
If execution times exceed 1000ms: Proceed to Step 7 (Database Performance Investigation)
If execution times are normal: Continue to Step 4
Good steps include:
- Clear action to take
- Expected outcome
- Conditional logic for different results
- What to do next based on findings
Decision Points and Branching Logic
Diagnostic runbooks especially need branching logic based on what investigation reveals.
Example decision tree:
Step 5: Determine Root Cause
Based on previous investigation:
If CPU usage over 80%: Proceed to Section A (Scale Capacity) If error logs show authentication failures: Proceed to Section B (Auth Service Recovery) If database queries are slow: Proceed to Section C (Database Investigation) If external API calls are timing out: Proceed to Section D (Third-Party Dependencies)
Decision points guide engineers through diagnostic workflows without prescribing a single linear path.
Verification Steps
After executing procedures, how do you confirm success?
Example:
Verification:
- Check Datadog dashboard: Payment API P95 latency should be under 500ms
- Review error rate: Should be under 0.1%
- Confirm customer reports: No recent payment failure complaints
- Monitor for 15 minutes: Metrics should remain stable
Explicit verification prevents declaring success prematurely.
Rollback Procedures
If the fix makes things worse, how do you undo it?
Example:
Rollback: If latency increases after scaling:
- Scale payment service back to 3 replicas:
kubectl scale deployment payment-api --replicas=3 - Verify rollback: Check replica count matches target
- Monitor impact: Latency should return to pre-scaling levels within 2 minutes
Rollback procedures provide safety nets for risky changes.
Escalation Criteria
When should you stop troubleshooting and get help?
Example:
Escalation: Escalate to database team if:
- Investigation exceeds 30 minutes without identifying root cause
- Database query performance degradation confirmed
- Multiple services affected simultaneously
Contact: Page #database-oncall via PagerDuty
Clear escalation criteria prevent engineers from struggling alone when they need specialized expertise.
Template Examples by Runbook Type
Different scenarios require different template structures. Here are practical examples for common runbook types.
Diagnostic Runbook Template
Use for investigating problems where root cause is unknown.
Title: [Service/Component] Performance Degradation
Purpose: Diagnose performance issues in [service name]
Symptoms:
- [Alert that fires]
- [Observable behavior]
- [User-facing impact]
Investigation Steps:
1. Check recent deployments
2. Review error logs
3. Examine resource utilization
4. Analyze database query performance
5. Verify external dependency health
For each step:
- Command to run
- Expected vs. concerning results
- Next action based on findings
Root Cause Determination:
[Decision tree mapping symptoms to causes]
Resolution:
[Link to appropriate recovery runbook] Recovery Runbook Template
Use for fixing known problems with established solutions.
Title: [Service Name] Service Restart
Purpose: Safely restart [service] to recover from [specific issue]
When to Use:
- [Triggering conditions]
- [Symptom descriptions]
Impact:
- [Expected downtime]
- [Affected functionality]
- [User experience during recovery]
Prerequisites:
- [Required access]
- [Necessary tools]
Recovery Procedure:
1. [Pre-restart verification]
2. [Backup/safety steps]
3. [Restart command with exact syntax]
4. [Post-restart verification]
5. [Monitoring period]
Verification:
- [Health check procedures]
- [Success criteria]
If Restart Fails:
- [Alternative approach]
- [Escalation path] Maintenance Runbook Template
Use for planned operational procedures.
Title: [System/Process] Maintenance Procedure
Purpose: [What maintenance accomplishes]
Schedule: [When this runs]
Duration: [Expected time]
Prerequisites:
- [Required notifications]
- [Backup requirements]
- [Access needed]
Maintenance Steps:
1. [Pre-maintenance verification]
2. [User communication]
3. [Backup procedures]
4. [Maintenance actions]
5. [Post-maintenance validation]
6. [Service restoration]
Rollback Plan:
[How to undo if problems occur]
Post-Maintenance:
- [Monitoring requirements]
- [Follow-up communication] Real-World Example: Complete Diagnostic Runbook
Here’s a full example showing how template components work together:
Title: Payment API High Latency Diagnostic
Owner: Platform Team
Last Updated: 2025-09-15 Last Validated: 2025-09-22
Purpose: Diagnose and identify root cause of elevated payment API latency
Scope: Use when payment API P95 latency exceeds 2000ms for over 5 minutes. Covers diagnostic investigation only—recovery procedures are in separate runbooks.
Prerequisites:
- Datadog access with payment-service dashboard permissions
- Kubernetes cluster access (view and describe permissions)
- Database read-only access
Step 1: Verify Alert Accuracy
Check Datadog payment-service dashboard to confirm latency metrics.
Expected: P95 latency over 2000ms, P50 over 800ms If latency is normal: False alarm, investigate alert configuration If latency is elevated: Continue to Step 2
Step 2: Check Recent Deployments
Review deployment history for the last 2 hours:
kubectl rollout history deployment/payment-api If deployment occurred within issue timeframe: Likely deployment-related, proceed to Step 8 (Deployment Investigation) If no recent deployments: Continue to Step 3
Step 3: Examine Error Logs
Search for errors in the last 15 minutes:
kubectl logs -l app=payment-api --tail=1000 | grep ERROR If timeout errors appear frequently: External dependency issue, proceed to Step 6 If authentication errors appear: Auth service problem, escalate to #auth-team If no significant errors: Continue to Step 4
Step 4: Check Resource Utilization
View pod CPU and memory:
kubectl top pods -l app=payment-api If CPU over 80%: Capacity issue, proceed to Step 7 (Scaling Investigation) If memory over 90%: Possible memory leak, proceed to Step 9 (Memory Investigation) If resources normal: Continue to Step 5
Step 5: Analyze Database Performance
Connect to the payment database and run query performance check:
kubectl exec -it deployment/payment-api -- psql $DATABASE_URL SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10; If queries over 1000ms: Database performance issue, escalate to #database-oncall If queries normal: Continue to Step 6
Step 6: Verify External Dependencies
Check status pages for:
- Stripe API: https://status.stripe.com
- SendGrid API: https://status.sendgrid.com
- Auth0: https://status.auth0.com
If external service degraded: External dependency problem. Document affected service and expected recovery time. Consider enabling degraded mode if available. If all external services healthy: Continue to Step 7
Step 7: Investigation Complete
If all previous steps show normal operation, latency cause is unclear.
Next Actions:
- Take thread dump for analysis:
POD=$(kubectl get pods -l app=payment-api -o jsonpath='{.items[0].metadata.name}')
kubectl exec $POD -- jstack 1 > threaddump.txt - Escalate to #platform-oncall with investigation findings
- Document symptoms and investigation results in incident timeline
Escalation Criteria:
- Investigation exceeds 30 minutes without identifying root cause
- Multiple potential causes identified requiring specialized expertise
- Customer impact continues increasing
Contact: Page #platform-oncall via PagerDuty
This example demonstrates how templates provide structure while remaining flexible for different investigation paths.
Copy-Paste Ready Runbook: Database Performance Degradation
Here’s a complete, production-ready runbook you can adapt for your PostgreSQL database.
Title: PostgreSQL Performance Degradation Recovery
Owner: Database Team
Last Updated: 2025-10-01 Last Validated: 2025-10-28
Purpose: Diagnose and resolve PostgreSQL database performance issues affecting application response times
Scope: Use when database query latency exceeds 500ms average or connection pool exhaustion occurs. Covers PostgreSQL 12+.
Prerequisites:
- PostgreSQL database admin access
- Kubernetes cluster access if using containerized database
- Datadog or Grafana access for database metrics
- PgAdmin or psql CLI tool installed
Step 1: Verify Database Health
Check current database connections and activity:
SELECT count(*) as total_connections,
sum(CASE WHEN state = 'active' THEN 1 ELSE 0 END) as active,
sum(CASE WHEN state = 'idle' THEN 1 ELSE 0 END) as idle,
sum(CASE WHEN state = 'idle in transaction' THEN 1 ELSE 0 END) as idle_in_transaction
FROM pg_stat_activity
WHERE datname = current_database(); Expected: Active connections under 50, idle in transaction under 5 If idle in transaction over 10: Connection leak detected, proceed to Step 6 If active over 80: Connection pool exhaustion, proceed to Step 7 If values normal: Continue to Step 2
Step 2: Identify Slow Queries
Find queries with high execution time:
SELECT substring(query, 1, 100) AS short_query,
round(mean_exec_time::numeric, 2) AS avg_ms,
calls,
round(total_exec_time::numeric, 2) AS total_ms
FROM pg_stat_statements
WHERE mean_exec_time > 500
ORDER BY mean_exec_time DESC
LIMIT 10; If no slow queries found: Performance issue may be infrastructure-related, proceed to Step 8 If slow queries identified: Document the queries and continue to Step 3
Step 3: Check for Missing Indexes
Identify tables with sequential scans that should have indexes:
SELECT schemaname, tablename,
seq_scan, seq_tup_read,
idx_scan, idx_tup_fetch,
seq_tup_read / seq_scan as avg_seq_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10; If avg_seq_read over 10000: Missing index on frequently scanned table Action: Document tables needing indexes. Coordinate with dev team before creating indexes in production.
Step 4: Check for Lock Contention
Find blocking queries:
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted; If blocking queries found: Lock contention detected Action: Terminate blocking query if safe (coordinate with application team first):
SELECT pg_terminate_backend(blocking_pid); Step 5: Check Cache Hit Ratio
Verify buffer cache effectiveness:
SELECT sum(heap_blks_read) as heap_read,
sum(heap_blks_hit) as heap_hit,
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables; Expected: Cache hit ratio over 0.99 (99%) If ratio under 0.95: Insufficient shared_buffers, requires configuration tuning Action: Schedule maintenance window to increase shared_buffers parameter
Step 6: Clean Up Idle Transactions
If idle in transaction connections detected in Step 1:
SELECT pid, usename, application_name, state, query_start, state_change
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < now() - interval '5 minutes'; Action: Terminate long-running idle transactions:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < now() - interval '5 minutes'; Verification: Re-run Step 1 query. Idle in transaction count should be under 5.
Step 7: Scale Connection Pool
If connection pool exhaustion detected:
For Kubernetes deployments:
kubectl scale deployment postgres-proxy --replicas=5 For application-side pooling, update connection pool settings:
- Increase max_connections in postgresql.conf
- Update application pool size (e.g., JDBC pool maxPoolSize)
- Restart application pods after configuration change
Verification: Monitor active connections for 5 minutes. Should stabilize under 80.
Step 8: Check Infrastructure Resources
If database performance degraded without query issues:
Check CPU and memory:
kubectl top pod -l app=postgres Check disk I/O:
kubectl exec -it postgres-pod -- iostat -x 1 5 If CPU over 80%: CPU bottleneck, consider vertical scaling If iowait over 20%: Disk I/O bottleneck, consider faster storage or read replicas If memory pressure: Increase instance size or optimize queries
Verification Steps:
- Re-run Step 2 query. Mean execution time should be under 500ms.
- Check application latency metrics. P95 should be under 1000ms.
- Monitor for 15 minutes to ensure stability.
Rollback Procedures:
- If connection pool scaling caused issues:
kubectl scale deployment postgres-proxy --replicas=3 - If terminated queries caused application errors: Application should automatically retry
- If configuration changes caused instability: Restore previous postgresql.conf and restart database
Escalation Criteria:
- Performance degradation continues after all diagnostic steps
- Database CPU over 90% sustained for over 10 minutes
- Replication lag exceeds 60 seconds
- Customer-facing errors increasing
Contact: Page #database-oncall via PagerDuty or Slack
Post-Recovery Actions:
- Document root cause in incident post-mortem
- Update query optimization backlog if slow queries identified
- Create indexes identified in Step 3 during next maintenance window
- Review connection pool configuration if exhaustion occurred
Copy-Paste Ready Runbook: Kubernetes Deployment Rollback
Here’s a complete runbook for rolling back problematic deployments.
Title: Kubernetes Deployment Rollback Procedure
Owner: Platform Team
Last Updated: 2025-10-01 Last Validated: 2025-10-30
Purpose: Safely rollback a Kubernetes deployment to previous stable version
Scope: Use when recent deployment causes service degradation, elevated errors, or failed health checks. Covers standard Kubernetes deployments using rolling update strategy.
Prerequisites:
- Kubernetes cluster access with deployment permissions
- kubectl CLI configured
- Service metrics dashboard access (Datadog, Grafana, etc.)
- Incident tracking system access
Step 1: Confirm Deployment Issue
Verify the issue is deployment-related:
kubectl get deployment/service-name -o yaml | grep -A 5 "conditions:" Check recent deployment events:
kubectl describe deployment/service-name | grep -A 10 "Events:" If deployment status shows failure: Proceed with rollback If deployment successful but metrics degraded: Verify timing correlation, then proceed If issue started before deployment: Not deployment-related, investigate other causes
Step 2: Check Deployment History
View available rollback versions:
kubectl rollout history deployment/service-name Identify the previous stable revision. Typically the revision before the current one.
Step 3: Notify Team
Post in incident channel before rollback:
Initiating rollback of service-name deployment
Current revision: X
Target revision: Y (previous stable version)
Reason: [elevated error rate / failed health checks / performance degradation]
ETA: 2-3 minutes for rollback completion Step 4: Execute Rollback
Rollback to previous revision:
kubectl rollout undo deployment/service-name Or rollback to specific revision:
kubectl rollout undo deployment/service-name --to-revision=5 Step 5: Monitor Rollback Progress
Watch rollback status:
kubectl rollout status deployment/service-name Expected output:
Waiting for deployment "service-name" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "service-name" rollout to finish: 2 of 3 updated replicas are available...
deployment "service-name" successfully rolled out If rollback stalls: Check pod events:
kubectl get pods -l app=service-name
kubectl describe pod <failing-pod-name> Step 6: Verify Service Health
Check pod status:
kubectl get pods -l app=service-name All pods should show “Running” status with “1/1” ready containers.
Check application logs:
kubectl logs -l app=service-name --tail=50 Verify no startup errors or exceptions.
Step 7: Validate Metrics Recovery
Check service metrics for 5 minutes:
- Error rate: Should return to baseline (under 1%)
- Response latency: P95 should be under normal thresholds
- Request rate: Should match expected traffic levels
- Health check success rate: Should be 100%
Verification Checklist:
- All pods running and ready
- Error rate returned to baseline
- Latency metrics recovered
- Health checks passing
- No exceptions in recent logs
- External monitoring confirms recovery
Step 8: Update Incident Timeline
Document rollback in incident tracker:
[Timestamp] Deployment rollback initiated
[Timestamp] Rollback completed, all pods running
[Timestamp] Metrics confirmed recovered to baseline
[Timestamp] Service restored to full functionality Post-Rollback Actions:
- Identify root cause of failed deployment
- Fix issue in code or configuration
- Test thoroughly in staging environment
- Schedule new deployment with fixes
- Update deployment checklist if process gaps identified
If Rollback Fails:
Scenario A: Pods won’t start after rollback
kubectl describe pod <pod-name> Check for ImagePullBackOff or CrashLoopBackOff. May indicate infrastructure issue rather than code issue.
Scenario B: Rollback completes but metrics don’t recover Issue may not be deployment-related. Escalate to platform team and investigate:
- Database performance
- External dependencies
- Infrastructure capacity
- Network issues
Scenario C: Cannot rollback due to breaking database migration
kubectl scale deployment/service-name --replicas=0 Engage database team to rollback migrations first, then retry deployment rollback.
Escalation Criteria:
- Rollback fails to complete after 10 minutes
- Pods crash immediately after rollback
- Metrics don’t recover 10 minutes after rollback completion
- Database migration prevents safe rollback
Contact: Escalate to #platform-oncall via PagerDuty
Rollback Safety Notes:
- Always verify previous revision is stable before rollback
- Coordinate with database team if migrations were included
- Rollback window narrows after breaking database changes
- Consider blue-green or canary deployments to reduce rollback need
Common Template Mistakes
Mistake 1: Too Much Detail
Some templates try to explain every concept and edge case, creating 20-page documents nobody reads during incidents.
Symptom: Engineers skip runbooks because they’re too long Fix: Link to detailed documentation instead of embedding it. Keep runbook steps concise and action-focused.
Mistake 2: Not Enough Detail
Other templates assume too much knowledge, leaving out critical information.
Symptom: Engineers can’t complete procedures because commands are vague or missing Fix: Include copy-paste-ready commands with placeholders for variables. Specify exact file paths and tool locations.
Mistake 3: Rigid Structure
Templates that force every runbook into identical format regardless of scenario create unusable documentation.
Symptom: Teams work around the template instead of using it Fix: Provide a recommended structure, not a mandatory form. Let teams omit irrelevant sections.
Mistake 4: No Ownership
Templates without owners become stale as systems evolve.
Symptom: Runbooks reference deprecated systems or obsolete procedures Fix: Assign ownership per runbook. Template metadata should include owner and last-validated date.
Mistake 5: Static Forever
Teams create templates once and never refine them based on actual usage.
Symptom: Same runbook problems repeat across all procedures Fix: Review templates quarterly. Update based on post-incident feedback and new best practices.
Making Structure Stick
Consistent structure only helps if teams actually follow it.
Make structure the default. When someone proposes creating a runbook, provide your standard format. Whether that’s a template file in Git, a starting format in your runbook tool, or a documented pattern to follow—make using the structure easier than starting blank.
Demonstrate value immediately. Standard formats should save time from day one. If your first attempt is overly complex, nobody will adopt it.
Integrate with workflow. If runbooks are in a wiki, create page templates. If they’re in version control, provide template files. If you use a runbook management platform, establish standard sections teams always include.
Train once, reinforce constantly. Show new team members the standard format during onboarding. Reference it in runbook reviews.
Celebrate good examples. When someone creates an excellent runbook following your structure, share it as a reference. Recognition reinforces adoption.
Structure That Improves Through Usage
The best runbook structures aren’t static formats—they evolve based on how procedures perform in practice.
Traditional approaches treat runbooks as documentation to follow exactly. But real incident response rarely matches perfectly. Engineers skip steps that don’t apply, add missing steps, or modify procedures based on actual conditions.
This gap between planned procedure and actual execution contains valuable information. Which steps get skipped consistently? Which procedures need additional context? What do engineers add that your standard structure misses?
Execution Tracking Reveals What Works: Platforms like Upstat track runbook execution step-by-step during incidents. Engineers record which steps they followed, what they discovered at each stage, and how they adapted procedures for the specific situation. This execution history reveals which structures work well and which need refinement—regardless of whether you manage runbooks in Git, wikis, or specialized tools.
After incidents, teams review execution data to improve their standard formats. If engineers consistently skip a step, maybe it’s unnecessary. If they repeatedly add the same context, that should become part of your template. If certain investigations reveal nothing useful, simplify that section.
This creates a continuous improvement loop: structure guides initial response, execution tracking captures actual usage, and feedback drives refinement. Over time, your standards evolve to match how teams actually work instead of how someone imagined they would work.
Start Simple, Iterate Based on Reality
Don’t try to create the perfect format immediately. Start with a basic structure covering title, purpose, steps, and verification. Create runbooks following that structure. Gather feedback. Refine.
After teams use your standard format for a month, ask:
- What information is consistently missing?
- Which sections feel like busywork?
- What do engineers keep adding manually?
- Which structures work best for which scenarios?
Update your approach based on answers. Structure should serve the team, not constrain them.
The goal isn’t documentation perfection. The goal is procedures that work—runbooks that help engineers resolve incidents faster, maintain services reliably, and capture knowledge effectively.
Good structure makes that goal achievable by providing proven organization without prescribing rigid process. It standardizes what should be consistent while leaving room for scenario-specific adaptation.
Start with consistent structure. Use it. Learn from actual execution. Improve continuously. That cycle turns adequate runbooks into excellent operational procedures that teams trust and actually follow.
Ready-to-Use Markdown Templates
The examples above provide complete, production-ready runbooks you can adapt immediately. For teams starting from scratch, here are three foundational templates covering common scenarios. Copy these into your documentation system and customize for your environment.
Basic Runbook Template
# Runbook Title
**Owner**: [Team Name]
**Last Updated**: YYYY-MM-DD
**Last Validated**: YYYY-MM-DD
## Purpose
[Brief description of what this runbook addresses and when to use it]
## Scope
- Use when: [triggering conditions]
- Covers: [what scenarios this handles]
- Does not cover: [what requires different runbooks]
## Prerequisites
- [ ] [Required access or permissions]
- [ ] [Tools needed]
- [ ] [Credentials or configurations]
## Investigation Steps
### Step 1: [Initial Verification]
[Description of what to check first]
**Command**:
```bash
[command to run] Expected result: [what normal looks like] If [condition A]: [next action] If [condition B]: [alternative action]
Step 2: [Next Diagnostic Step]
[Repeat structure for each step]
Verification
After completing recovery steps:
- [Health check 1]
- [Health check 2]
- [Monitoring check]
- [User-facing validation]
Rollback Procedures
If recovery steps make things worse:
- [Undo command 1]
- [Verification after undo]
- [Next steps if rollback fails]
Escalation Criteria
Escalate if:
- [Time threshold exceeded]
- [Impact threshold exceeded]
- [Specific failure condition]
Contact: [How to reach on-call team]
Post-Recovery Actions
- Document root cause in incident tracker
- Update runbook if procedure gaps identified
- Schedule follow-up work if needed
### Quick Diagnostic Template
```markdown
# [Service] Performance Diagnostic
**When to use**: [Triggering alert or symptom]
## Quick Checks (5 minutes)
1. Recent deployments: `[command]`
2. Error logs: `[command]`
3. Resource utilization: `[command]`
4. External dependencies: [status page URLs]
## Deep Investigation
### If [symptom A detected]:
- Root cause: [likely cause]
- Resolution: See [recovery runbook name]
### If [symptom B detected]:
- Root cause: [likely cause]
- Resolution: See [recovery runbook name]
### If all checks normal:
- Escalate to [team]
- Gather: [diagnostic data to collect] Service Recovery Template
# [Service] Recovery Procedure
**Purpose**: Recover [service] from [specific problem]
**Impact**: [Expected downtime and user experience]
## Pre-Recovery Checks
- [ ] Confirm issue matches expected symptoms
- [ ] Verify no ongoing maintenance
- [ ] Notify team in #incident channel
## Recovery Steps
1. **Backup current state**
```bash
[backup command] Execute recovery
[recovery command]Verify recovery
[verification command]Expected: [success criteria]
Monitor for [duration]
- Metric 1: [threshold]
- Metric 2: [threshold]
If Recovery Fails
Scenario A: [Specific failure mode]
- Action: [what to do]
Scenario B: [Different failure mode]
- Action: [alternative approach]
If all recovery attempts fail:
- Escalate to [team/person]
- Provide: [diagnostic information to include]
These templates work in any documentation system: Git repositories, wikis, runbook management tools, or simple markdown files. Adapt the structure for your infrastructure, tools, and team workflows.
The key is starting with consistent structure, then refining based on what works in practice. Copy these templates, customize for your environment, and improve them through real incident experience.
Explore In Upstat
Create structured runbooks with step-by-step execution tracking that reveals which procedures actually work in practice. Track execution patterns to continuously improve your operational procedures.
