Blog Home  /  runbook-template-examples

Runbook Template and Examples

Runbook templates provide consistent structure for operational procedures, ensuring teams respond to incidents the same effective way every time. This guide explains essential template components, provides practical examples for different scenarios, and shows how to create templates that improve through real-world usage.

September 23, 2025 13 min read
runbook

The Problem with Starting from Scratch

It’s 3 AM. An alert fires for the payment service. Your on-call engineer acknowledges the page and opens their terminal. They know the service needs investigation, but where do they start? Check the database? Review recent deployments? Examine error logs? Restart the service?

About Runbook Structure: This guide teaches runbook structure best practices that teams implement in their own documentation systems (Git repositories, wikis, or documentation tools). These are team-maintained organizational patterns, not built-in platform features. Runbook management platforms like Upstat focus on execution tracking and decision branching—teams maintain their own structure standards for how procedures are written. Whether you copy template files in Git or follow consistent formats when creating runbooks, these principles ensure comprehensive coverage without constraining scenario-specific adaptation.

Without a runbook, the engineer improvises. They might solve the problem quickly—or they might waste an hour exploring dead ends. Even worse, the next engineer who faces this same issue will improvise differently, rediscovering the same troubleshooting path from scratch.

This is the cost of missing runbooks: inconsistent response, wasted time, and knowledge that stays locked in individual minds instead of captured in shared documentation.

Consistent runbook structure solves this by ensuring comprehensive coverage. Instead of creating procedures from scratch every time, teams follow proven frameworks that include all essential components. Structure doesn’t constrain creativity—it eliminates the need to reinvent basic organization, letting engineers focus on scenario-specific details.

Why Structure Matters for Operational Consistency

Consistent structure creates standardization without rigidity. It ensures every runbook contains the essentials while allowing customization for specific scenarios.

Speed: Starting with consistent structure is faster than blank-page documentation. Whether copying a template file or following a standard format, the framework exists—you fill in scenario-specific details.

Completeness: Structure prompts you to document everything responders need. Without it, teams forget critical sections like rollback procedures or escalation criteria.

Consistency: When all runbooks follow the same structure, engineers know exactly where to find specific information during incidents. Consistent formatting reduces cognitive load during high-pressure situations.

Quality: Well-designed templates incorporate best practices automatically. New team members create better runbooks because the template guides them toward comprehensive documentation.

Maintenance: Standardized structure makes updates easier. When every runbook follows the same format, improving one section’s approach can cascade across all procedures.

The goal isn’t bureaucratic uniformity. The goal is making runbooks immediately useful to anyone who needs them—especially during incidents when clarity matters most.

Essential Template Components

Effective runbook templates include these core sections. Not every runbook needs every section, but these form the foundation from which you customize.

Title and Metadata

Start with clear identification:

  • Title: Descriptive name explaining what this runbook addresses
  • Owner: Team or individual responsible for maintaining accuracy
  • Last Updated: When content was last modified
  • Last Validated: When procedure was last tested or confirmed to work

These metadata fields help teams track runbook health. If a runbook hasn’t been validated in six months, it needs review before the next incident relies on it.

Purpose and Scope

Explain what this runbook addresses and when to use it. Clear scope prevents confusion about applicability.

Example:

Purpose: Diagnose and resolve high API latency affecting the payment service.

Scope: Use this runbook when payment API P95 latency exceeds 2000ms for over 5 minutes. Does not cover database failover scenarios—see Database Failover Runbook for those procedures.

Explicit scope statements help engineers select the right runbook quickly.

Prerequisites

Document what must be true before executing this runbook:

  • Required access permissions
  • Tools or credentials needed
  • System state expectations
  • Prior procedures that should complete first

Example:

Prerequisites:

  • SSH access to production payment servers
  • Datadog account with payment-service dashboard access
  • On-call rotation membership (required for production changes)
  • Payment database credentials (stored in 1Password vault)

Clear prerequisites prevent engineers from starting procedures they can’t complete.

Step-by-Step Instructions

This is the heart of the runbook. Each step should be specific, actionable, and verifiable.

Poor step: “Check if the database is slow”

Better step:

Step 3: Check Database Query Performance

Run this query to identify slow queries in the last 15 minutes:

SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10;

Expected result: Query execution times under 500ms

If execution times exceed 1000ms: Proceed to Step 7 (Database Performance Investigation)

If execution times are normal: Continue to Step 4

Good steps include:

  • Clear action to take
  • Expected outcome
  • Conditional logic for different results
  • What to do next based on findings

Decision Points and Branching Logic

Diagnostic runbooks especially need branching logic based on what investigation reveals.

Example decision tree:

Step 5: Determine Root Cause

Based on previous investigation:

If CPU usage over 80%: Proceed to Section A (Scale Capacity) If error logs show authentication failures: Proceed to Section B (Auth Service Recovery) If database queries are slow: Proceed to Section C (Database Investigation) If external API calls are timing out: Proceed to Section D (Third-Party Dependencies)

Decision points guide engineers through diagnostic workflows without prescribing a single linear path.

Verification Steps

After executing procedures, how do you confirm success?

Example:

Verification:

  1. Check Datadog dashboard: Payment API P95 latency should be under 500ms
  2. Review error rate: Should be under 0.1%
  3. Confirm customer reports: No recent payment failure complaints
  4. Monitor for 15 minutes: Metrics should remain stable

Explicit verification prevents declaring success prematurely.

Rollback Procedures

If the fix makes things worse, how do you undo it?

Example:

Rollback: If latency increases after scaling:

  1. Scale payment service back to 3 replicas: kubectl scale deployment payment-api --replicas=3
  2. Verify rollback: Check replica count matches target
  3. Monitor impact: Latency should return to pre-scaling levels within 2 minutes

Rollback procedures provide safety nets for risky changes.

Escalation Criteria

When should you stop troubleshooting and get help?

Example:

Escalation: Escalate to database team if:

  • Investigation exceeds 30 minutes without identifying root cause
  • Database query performance degradation confirmed
  • Multiple services affected simultaneously

Contact: Page #database-oncall via PagerDuty

Clear escalation criteria prevent engineers from struggling alone when they need specialized expertise.

Template Examples by Runbook Type

Different scenarios require different template structures. Here are practical examples for common runbook types.

Diagnostic Runbook Template

Use for investigating problems where root cause is unknown.

Title: [Service/Component] Performance Degradation

Purpose: Diagnose performance issues in [service name]

Symptoms:
- [Alert that fires]
- [Observable behavior]
- [User-facing impact]

Investigation Steps:
1. Check recent deployments
2. Review error logs
3. Examine resource utilization
4. Analyze database query performance
5. Verify external dependency health

For each step:
- Command to run
- Expected vs. concerning results
- Next action based on findings

Root Cause Determination:
[Decision tree mapping symptoms to causes]

Resolution:
[Link to appropriate recovery runbook]

Recovery Runbook Template

Use for fixing known problems with established solutions.

Title: [Service Name] Service Restart

Purpose: Safely restart [service] to recover from [specific issue]

When to Use:
- [Triggering conditions]
- [Symptom descriptions]

Impact:
- [Expected downtime]
- [Affected functionality]
- [User experience during recovery]

Prerequisites:
- [Required access]
- [Necessary tools]

Recovery Procedure:
1. [Pre-restart verification]
2. [Backup/safety steps]
3. [Restart command with exact syntax]
4. [Post-restart verification]
5. [Monitoring period]

Verification:
- [Health check procedures]
- [Success criteria]

If Restart Fails:
- [Alternative approach]
- [Escalation path]

Maintenance Runbook Template

Use for planned operational procedures.

Title: [System/Process] Maintenance Procedure

Purpose: [What maintenance accomplishes]

Schedule: [When this runs]

Duration: [Expected time]

Prerequisites:
- [Required notifications]
- [Backup requirements]
- [Access needed]

Maintenance Steps:
1. [Pre-maintenance verification]
2. [User communication]
3. [Backup procedures]
4. [Maintenance actions]
5. [Post-maintenance validation]
6. [Service restoration]

Rollback Plan:
[How to undo if problems occur]

Post-Maintenance:
- [Monitoring requirements]
- [Follow-up communication]

Real-World Example: Complete Diagnostic Runbook

Here’s a full example showing how template components work together:

Title: Payment API High Latency Diagnostic

Owner: Platform Team

Last Updated: 2025-09-15 Last Validated: 2025-09-22

Purpose: Diagnose and identify root cause of elevated payment API latency

Scope: Use when payment API P95 latency exceeds 2000ms for over 5 minutes. Covers diagnostic investigation only—recovery procedures are in separate runbooks.

Prerequisites:

  • Datadog access with payment-service dashboard permissions
  • Kubernetes cluster access (view and describe permissions)
  • Database read-only access

Step 1: Verify Alert Accuracy

Check Datadog payment-service dashboard to confirm latency metrics.

Expected: P95 latency over 2000ms, P50 over 800ms If latency is normal: False alarm, investigate alert configuration If latency is elevated: Continue to Step 2

Step 2: Check Recent Deployments

Review deployment history for the last 2 hours:

kubectl rollout history deployment/payment-api

If deployment occurred within issue timeframe: Likely deployment-related, proceed to Step 8 (Deployment Investigation) If no recent deployments: Continue to Step 3

Step 3: Examine Error Logs

Search for errors in the last 15 minutes:

kubectl logs -l app=payment-api --tail=1000 | grep ERROR

If timeout errors appear frequently: External dependency issue, proceed to Step 6 If authentication errors appear: Auth service problem, escalate to #auth-team If no significant errors: Continue to Step 4

Step 4: Check Resource Utilization

View pod CPU and memory:

kubectl top pods -l app=payment-api

If CPU over 80%: Capacity issue, proceed to Step 7 (Scaling Investigation) If memory over 90%: Possible memory leak, proceed to Step 9 (Memory Investigation) If resources normal: Continue to Step 5

Step 5: Analyze Database Performance

Connect to the payment database and run query performance check:

kubectl exec -it deployment/payment-api -- psql $DATABASE_URL
SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10;

If queries over 1000ms: Database performance issue, escalate to #database-oncall If queries normal: Continue to Step 6

Step 6: Verify External Dependencies

Check status pages for:

If external service degraded: External dependency problem. Document affected service and expected recovery time. Consider enabling degraded mode if available. If all external services healthy: Continue to Step 7

Step 7: Investigation Complete

If all previous steps show normal operation, latency cause is unclear.

Next Actions:

  1. Take thread dump for analysis:
POD=$(kubectl get pods -l app=payment-api -o jsonpath='{.items[0].metadata.name}')
kubectl exec $POD -- jstack 1 > threaddump.txt
  1. Escalate to #platform-oncall with investigation findings
  2. Document symptoms and investigation results in incident timeline

Escalation Criteria:

  • Investigation exceeds 30 minutes without identifying root cause
  • Multiple potential causes identified requiring specialized expertise
  • Customer impact continues increasing

Contact: Page #platform-oncall via PagerDuty


This example demonstrates how templates provide structure while remaining flexible for different investigation paths.

Copy-Paste Ready Runbook: Database Performance Degradation

Here’s a complete, production-ready runbook you can adapt for your PostgreSQL database.

Title: PostgreSQL Performance Degradation Recovery

Owner: Database Team

Last Updated: 2025-10-01 Last Validated: 2025-10-28

Purpose: Diagnose and resolve PostgreSQL database performance issues affecting application response times

Scope: Use when database query latency exceeds 500ms average or connection pool exhaustion occurs. Covers PostgreSQL 12+.

Prerequisites:

  • PostgreSQL database admin access
  • Kubernetes cluster access if using containerized database
  • Datadog or Grafana access for database metrics
  • PgAdmin or psql CLI tool installed

Step 1: Verify Database Health

Check current database connections and activity:

SELECT count(*) as total_connections,
       sum(CASE WHEN state = 'active' THEN 1 ELSE 0 END) as active,
       sum(CASE WHEN state = 'idle' THEN 1 ELSE 0 END) as idle,
       sum(CASE WHEN state = 'idle in transaction' THEN 1 ELSE 0 END) as idle_in_transaction
FROM pg_stat_activity
WHERE datname = current_database();

Expected: Active connections under 50, idle in transaction under 5 If idle in transaction over 10: Connection leak detected, proceed to Step 6 If active over 80: Connection pool exhaustion, proceed to Step 7 If values normal: Continue to Step 2

Step 2: Identify Slow Queries

Find queries with high execution time:

SELECT substring(query, 1, 100) AS short_query,
       round(mean_exec_time::numeric, 2) AS avg_ms,
       calls,
       round(total_exec_time::numeric, 2) AS total_ms
FROM pg_stat_statements
WHERE mean_exec_time > 500
ORDER BY mean_exec_time DESC
LIMIT 10;

If no slow queries found: Performance issue may be infrastructure-related, proceed to Step 8 If slow queries identified: Document the queries and continue to Step 3

Step 3: Check for Missing Indexes

Identify tables with sequential scans that should have indexes:

SELECT schemaname, tablename,
       seq_scan, seq_tup_read,
       idx_scan, idx_tup_fetch,
       seq_tup_read / seq_scan as avg_seq_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;

If avg_seq_read over 10000: Missing index on frequently scanned table Action: Document tables needing indexes. Coordinate with dev team before creating indexes in production.

Step 4: Check for Lock Contention

Find blocking queries:

SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_statement,
       blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

If blocking queries found: Lock contention detected Action: Terminate blocking query if safe (coordinate with application team first):

SELECT pg_terminate_backend(blocking_pid);

Step 5: Check Cache Hit Ratio

Verify buffer cache effectiveness:

SELECT sum(heap_blks_read) as heap_read,
       sum(heap_blks_hit) as heap_hit,
       sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables;

Expected: Cache hit ratio over 0.99 (99%) If ratio under 0.95: Insufficient shared_buffers, requires configuration tuning Action: Schedule maintenance window to increase shared_buffers parameter

Step 6: Clean Up Idle Transactions

If idle in transaction connections detected in Step 1:

SELECT pid, usename, application_name, state, query_start, state_change
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - interval '5 minutes';

Action: Terminate long-running idle transactions:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - interval '5 minutes';

Verification: Re-run Step 1 query. Idle in transaction count should be under 5.

Step 7: Scale Connection Pool

If connection pool exhaustion detected:

For Kubernetes deployments:

kubectl scale deployment postgres-proxy --replicas=5

For application-side pooling, update connection pool settings:

  • Increase max_connections in postgresql.conf
  • Update application pool size (e.g., JDBC pool maxPoolSize)
  • Restart application pods after configuration change

Verification: Monitor active connections for 5 minutes. Should stabilize under 80.

Step 8: Check Infrastructure Resources

If database performance degraded without query issues:

Check CPU and memory:

kubectl top pod -l app=postgres

Check disk I/O:

kubectl exec -it postgres-pod -- iostat -x 1 5

If CPU over 80%: CPU bottleneck, consider vertical scaling If iowait over 20%: Disk I/O bottleneck, consider faster storage or read replicas If memory pressure: Increase instance size or optimize queries

Verification Steps:

  1. Re-run Step 2 query. Mean execution time should be under 500ms.
  2. Check application latency metrics. P95 should be under 1000ms.
  3. Monitor for 15 minutes to ensure stability.

Rollback Procedures:

  • If connection pool scaling caused issues:
    kubectl scale deployment postgres-proxy --replicas=3
  • If terminated queries caused application errors: Application should automatically retry
  • If configuration changes caused instability: Restore previous postgresql.conf and restart database

Escalation Criteria:

  • Performance degradation continues after all diagnostic steps
  • Database CPU over 90% sustained for over 10 minutes
  • Replication lag exceeds 60 seconds
  • Customer-facing errors increasing

Contact: Page #database-oncall via PagerDuty or Slack

Post-Recovery Actions:

  1. Document root cause in incident post-mortem
  2. Update query optimization backlog if slow queries identified
  3. Create indexes identified in Step 3 during next maintenance window
  4. Review connection pool configuration if exhaustion occurred

Copy-Paste Ready Runbook: Kubernetes Deployment Rollback

Here’s a complete runbook for rolling back problematic deployments.

Title: Kubernetes Deployment Rollback Procedure

Owner: Platform Team

Last Updated: 2025-10-01 Last Validated: 2025-10-30

Purpose: Safely rollback a Kubernetes deployment to previous stable version

Scope: Use when recent deployment causes service degradation, elevated errors, or failed health checks. Covers standard Kubernetes deployments using rolling update strategy.

Prerequisites:

  • Kubernetes cluster access with deployment permissions
  • kubectl CLI configured
  • Service metrics dashboard access (Datadog, Grafana, etc.)
  • Incident tracking system access

Step 1: Confirm Deployment Issue

Verify the issue is deployment-related:

kubectl get deployment/service-name -o yaml | grep -A 5 "conditions:"

Check recent deployment events:

kubectl describe deployment/service-name | grep -A 10 "Events:"

If deployment status shows failure: Proceed with rollback If deployment successful but metrics degraded: Verify timing correlation, then proceed If issue started before deployment: Not deployment-related, investigate other causes

Step 2: Check Deployment History

View available rollback versions:

kubectl rollout history deployment/service-name

Identify the previous stable revision. Typically the revision before the current one.

Step 3: Notify Team

Post in incident channel before rollback:

Initiating rollback of service-name deployment
Current revision: X
Target revision: Y (previous stable version)
Reason: [elevated error rate / failed health checks / performance degradation]
ETA: 2-3 minutes for rollback completion

Step 4: Execute Rollback

Rollback to previous revision:

kubectl rollout undo deployment/service-name

Or rollback to specific revision:

kubectl rollout undo deployment/service-name --to-revision=5

Step 5: Monitor Rollback Progress

Watch rollback status:

kubectl rollout status deployment/service-name

Expected output:

Waiting for deployment "service-name" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "service-name" rollout to finish: 2 of 3 updated replicas are available...
deployment "service-name" successfully rolled out

If rollback stalls: Check pod events:

kubectl get pods -l app=service-name
kubectl describe pod <failing-pod-name>

Step 6: Verify Service Health

Check pod status:

kubectl get pods -l app=service-name

All pods should show “Running” status with “1/1” ready containers.

Check application logs:

kubectl logs -l app=service-name --tail=50

Verify no startup errors or exceptions.

Step 7: Validate Metrics Recovery

Check service metrics for 5 minutes:

  • Error rate: Should return to baseline (under 1%)
  • Response latency: P95 should be under normal thresholds
  • Request rate: Should match expected traffic levels
  • Health check success rate: Should be 100%

Verification Checklist:

  • All pods running and ready
  • Error rate returned to baseline
  • Latency metrics recovered
  • Health checks passing
  • No exceptions in recent logs
  • External monitoring confirms recovery

Step 8: Update Incident Timeline

Document rollback in incident tracker:

[Timestamp] Deployment rollback initiated
[Timestamp] Rollback completed, all pods running
[Timestamp] Metrics confirmed recovered to baseline
[Timestamp] Service restored to full functionality

Post-Rollback Actions:

  1. Identify root cause of failed deployment
  2. Fix issue in code or configuration
  3. Test thoroughly in staging environment
  4. Schedule new deployment with fixes
  5. Update deployment checklist if process gaps identified

If Rollback Fails:

Scenario A: Pods won’t start after rollback

kubectl describe pod <pod-name>

Check for ImagePullBackOff or CrashLoopBackOff. May indicate infrastructure issue rather than code issue.

Scenario B: Rollback completes but metrics don’t recover Issue may not be deployment-related. Escalate to platform team and investigate:

  • Database performance
  • External dependencies
  • Infrastructure capacity
  • Network issues

Scenario C: Cannot rollback due to breaking database migration

kubectl scale deployment/service-name --replicas=0

Engage database team to rollback migrations first, then retry deployment rollback.

Escalation Criteria:

  • Rollback fails to complete after 10 minutes
  • Pods crash immediately after rollback
  • Metrics don’t recover 10 minutes after rollback completion
  • Database migration prevents safe rollback

Contact: Escalate to #platform-oncall via PagerDuty

Rollback Safety Notes:

  • Always verify previous revision is stable before rollback
  • Coordinate with database team if migrations were included
  • Rollback window narrows after breaking database changes
  • Consider blue-green or canary deployments to reduce rollback need

Common Template Mistakes

Mistake 1: Too Much Detail

Some templates try to explain every concept and edge case, creating 20-page documents nobody reads during incidents.

Symptom: Engineers skip runbooks because they’re too long Fix: Link to detailed documentation instead of embedding it. Keep runbook steps concise and action-focused.

Mistake 2: Not Enough Detail

Other templates assume too much knowledge, leaving out critical information.

Symptom: Engineers can’t complete procedures because commands are vague or missing Fix: Include copy-paste-ready commands with placeholders for variables. Specify exact file paths and tool locations.

Mistake 3: Rigid Structure

Templates that force every runbook into identical format regardless of scenario create unusable documentation.

Symptom: Teams work around the template instead of using it Fix: Provide a recommended structure, not a mandatory form. Let teams omit irrelevant sections.

Mistake 4: No Ownership

Templates without owners become stale as systems evolve.

Symptom: Runbooks reference deprecated systems or obsolete procedures Fix: Assign ownership per runbook. Template metadata should include owner and last-validated date.

Mistake 5: Static Forever

Teams create templates once and never refine them based on actual usage.

Symptom: Same runbook problems repeat across all procedures Fix: Review templates quarterly. Update based on post-incident feedback and new best practices.

Making Structure Stick

Consistent structure only helps if teams actually follow it.

Make structure the default. When someone proposes creating a runbook, provide your standard format. Whether that’s a template file in Git, a starting format in your runbook tool, or a documented pattern to follow—make using the structure easier than starting blank.

Demonstrate value immediately. Standard formats should save time from day one. If your first attempt is overly complex, nobody will adopt it.

Integrate with workflow. If runbooks are in a wiki, create page templates. If they’re in version control, provide template files. If you use a runbook management platform, establish standard sections teams always include.

Train once, reinforce constantly. Show new team members the standard format during onboarding. Reference it in runbook reviews.

Celebrate good examples. When someone creates an excellent runbook following your structure, share it as a reference. Recognition reinforces adoption.

Structure That Improves Through Usage

The best runbook structures aren’t static formats—they evolve based on how procedures perform in practice.

Traditional approaches treat runbooks as documentation to follow exactly. But real incident response rarely matches perfectly. Engineers skip steps that don’t apply, add missing steps, or modify procedures based on actual conditions.

This gap between planned procedure and actual execution contains valuable information. Which steps get skipped consistently? Which procedures need additional context? What do engineers add that your standard structure misses?

Execution Tracking Reveals What Works: Platforms like Upstat track runbook execution step-by-step during incidents. Engineers record which steps they followed, what they discovered at each stage, and how they adapted procedures for the specific situation. This execution history reveals which structures work well and which need refinement—regardless of whether you manage runbooks in Git, wikis, or specialized tools.

After incidents, teams review execution data to improve their standard formats. If engineers consistently skip a step, maybe it’s unnecessary. If they repeatedly add the same context, that should become part of your template. If certain investigations reveal nothing useful, simplify that section.

This creates a continuous improvement loop: structure guides initial response, execution tracking captures actual usage, and feedback drives refinement. Over time, your standards evolve to match how teams actually work instead of how someone imagined they would work.

Start Simple, Iterate Based on Reality

Don’t try to create the perfect format immediately. Start with a basic structure covering title, purpose, steps, and verification. Create runbooks following that structure. Gather feedback. Refine.

After teams use your standard format for a month, ask:

  • What information is consistently missing?
  • Which sections feel like busywork?
  • What do engineers keep adding manually?
  • Which structures work best for which scenarios?

Update your approach based on answers. Structure should serve the team, not constrain them.

The goal isn’t documentation perfection. The goal is procedures that work—runbooks that help engineers resolve incidents faster, maintain services reliably, and capture knowledge effectively.

Good structure makes that goal achievable by providing proven organization without prescribing rigid process. It standardizes what should be consistent while leaving room for scenario-specific adaptation.

Start with consistent structure. Use it. Learn from actual execution. Improve continuously. That cycle turns adequate runbooks into excellent operational procedures that teams trust and actually follow.

Ready-to-Use Markdown Templates

The examples above provide complete, production-ready runbooks you can adapt immediately. For teams starting from scratch, here are three foundational templates covering common scenarios. Copy these into your documentation system and customize for your environment.

Basic Runbook Template

# Runbook Title

**Owner**: [Team Name]

**Last Updated**: YYYY-MM-DD
**Last Validated**: YYYY-MM-DD

## Purpose

[Brief description of what this runbook addresses and when to use it]

## Scope

- Use when: [triggering conditions]
- Covers: [what scenarios this handles]
- Does not cover: [what requires different runbooks]

## Prerequisites

- [ ] [Required access or permissions]
- [ ] [Tools needed]
- [ ] [Credentials or configurations]

## Investigation Steps

### Step 1: [Initial Verification]

[Description of what to check first]

**Command**:
```bash
[command to run]

Expected result: [what normal looks like] If [condition A]: [next action] If [condition B]: [alternative action]

Step 2: [Next Diagnostic Step]

[Repeat structure for each step]

Verification

After completing recovery steps:

  1. [Health check 1]
  2. [Health check 2]
  3. [Monitoring check]
  4. [User-facing validation]

Rollback Procedures

If recovery steps make things worse:

  1. [Undo command 1]
  2. [Verification after undo]
  3. [Next steps if rollback fails]

Escalation Criteria

Escalate if:

  • [Time threshold exceeded]
  • [Impact threshold exceeded]
  • [Specific failure condition]

Contact: [How to reach on-call team]

Post-Recovery Actions

  1. Document root cause in incident tracker
  2. Update runbook if procedure gaps identified
  3. Schedule follow-up work if needed

### Quick Diagnostic Template

```markdown
# [Service] Performance Diagnostic

**When to use**: [Triggering alert or symptom]

## Quick Checks (5 minutes)

1. Recent deployments: `[command]`
2. Error logs: `[command]`
3. Resource utilization: `[command]`
4. External dependencies: [status page URLs]

## Deep Investigation

### If [symptom A detected]:
- Root cause: [likely cause]
- Resolution: See [recovery runbook name]

### If [symptom B detected]:
- Root cause: [likely cause]
- Resolution: See [recovery runbook name]

### If all checks normal:
- Escalate to [team]
- Gather: [diagnostic data to collect]

Service Recovery Template

# [Service] Recovery Procedure

**Purpose**: Recover [service] from [specific problem]

**Impact**: [Expected downtime and user experience]

## Pre-Recovery Checks

- [ ] Confirm issue matches expected symptoms
- [ ] Verify no ongoing maintenance
- [ ] Notify team in #incident channel

## Recovery Steps

1. **Backup current state**
   ```bash
   [backup command]
  1. Execute recovery

    [recovery command]
  2. Verify recovery

    [verification command]

    Expected: [success criteria]

  3. Monitor for [duration]

    • Metric 1: [threshold]
    • Metric 2: [threshold]

If Recovery Fails

Scenario A: [Specific failure mode]

  • Action: [what to do]

Scenario B: [Different failure mode]

  • Action: [alternative approach]

If all recovery attempts fail:

  • Escalate to [team/person]
  • Provide: [diagnostic information to include]

These templates work in any documentation system: Git repositories, wikis, runbook management tools, or simple markdown files. Adapt the structure for your infrastructure, tools, and team workflows.

The key is starting with consistent structure, then refining based on what works in practice. Copy these templates, customize for your environment, and improve them through real incident experience.

Explore In Upstat

Create structured runbooks with step-by-step execution tracking that reveals which procedures actually work in practice. Track execution patterns to continuously improve your operational procedures.