Blog Home  /  runbook-template-examples

Runbook Template and Examples

Runbook templates provide consistent structure for operational procedures, ensuring teams respond to incidents the same effective way every time. This guide explains essential template components, provides practical examples for different scenarios, and shows how to create templates that improve through real-world usage.

September 23, 2025 undefined
runbook

The Problem with Starting from Scratch

It’s 3 AM. An alert fires for the payment service. Your on-call engineer acknowledges the page and opens their terminal. They know the service needs investigation, but where do they start? Check the database? Review recent deployments? Examine error logs? Restart the service?

Without a runbook, the engineer improvises. They might solve the problem quickly—or they might waste an hour exploring dead ends. Even worse, the next engineer who faces this same issue will improvise differently, rediscovering the same troubleshooting path from scratch.

This is the cost of missing runbooks: inconsistent response, wasted time, and knowledge that stays locked in individual minds instead of captured in shared documentation.

Runbook templates solve this by providing consistent structure. Instead of creating procedures from scratch every time, teams start with proven frameworks that ensure comprehensive coverage. Templates don’t constrain creativity—they eliminate the need to reinvent basic structure, letting engineers focus on scenario-specific details.

Why Templates Matter for Operational Consistency

Templates create standardization without rigidity. They ensure every runbook contains the essentials while allowing customization for specific scenarios.

Speed: Starting with a template is faster than blank-page documentation. The framework exists; you fill in the details for your specific procedure.

Completeness: Templates prompt you to document everything responders need. Without templates, teams forget critical sections like rollback procedures or escalation criteria.

Consistency: When all runbooks follow the same structure, engineers know exactly where to find specific information during incidents. Consistent formatting reduces cognitive load during high-pressure situations.

Quality: Well-designed templates incorporate best practices automatically. New team members create better runbooks because the template guides them toward comprehensive documentation.

Maintenance: Standardized structure makes updates easier. When every runbook follows the same format, improving one section’s approach can cascade across all procedures.

The goal isn’t bureaucratic uniformity. The goal is making runbooks immediately useful to anyone who needs them—especially during incidents when clarity matters most.

Essential Template Components

Effective runbook templates include these core sections. Not every runbook needs every section, but these form the foundation from which you customize.

Title and Metadata

Start with clear identification:

  • Title: Descriptive name explaining what this runbook addresses
  • Owner: Team or individual responsible for maintaining accuracy
  • Last Updated: When content was last modified
  • Last Validated: When procedure was last tested or confirmed to work

These metadata fields help teams track runbook health. If a runbook hasn’t been validated in six months, it needs review before the next incident relies on it.

Purpose and Scope

Explain what this runbook addresses and when to use it. Clear scope prevents confusion about applicability.

Example:

Purpose: Diagnose and resolve high API latency affecting the payment service.

Scope: Use this runbook when payment API P95 latency exceeds 2000ms for over 5 minutes. Does not cover database failover scenarios—see Database Failover Runbook for those procedures.

Explicit scope statements help engineers select the right runbook quickly.

Prerequisites

Document what must be true before executing this runbook:

  • Required access permissions
  • Tools or credentials needed
  • System state expectations
  • Prior procedures that should complete first

Example:

Prerequisites:

  • SSH access to production payment servers
  • Datadog account with payment-service dashboard access
  • On-call rotation membership (required for production changes)
  • Payment database credentials (stored in 1Password vault)

Clear prerequisites prevent engineers from starting procedures they can’t complete.

Step-by-Step Instructions

This is the heart of the runbook. Each step should be specific, actionable, and verifiable.

Poor step: “Check if the database is slow”

Better step:

Step 3: Check Database Query Performance

Run this query to identify slow queries in the last 15 minutes:

SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10;

Expected result: Query execution times under 500ms

If execution times exceed 1000ms: Proceed to Step 7 (Database Performance Investigation)

If execution times are normal: Continue to Step 4

Good steps include:

  • Clear action to take
  • Expected outcome
  • Conditional logic for different results
  • What to do next based on findings

Decision Points and Branching Logic

Diagnostic runbooks especially need branching logic based on what investigation reveals.

Example decision tree:

Step 5: Determine Root Cause

Based on previous investigation:

If CPU usage > 80%: Proceed to Section A (Scale Capacity) If error logs show authentication failures: Proceed to Section B (Auth Service Recovery) If database queries are slow: Proceed to Section C (Database Investigation) If external API calls are timing out: Proceed to Section D (Third-Party Dependencies)

Decision points guide engineers through diagnostic workflows without prescribing a single linear path.

Verification Steps

After executing procedures, how do you confirm success?

Example:

Verification:

  1. Check Datadog dashboard: Payment API P95 latency should be under 500ms
  2. Review error rate: Should be under 0.1%
  3. Confirm customer reports: No recent payment failure complaints
  4. Monitor for 15 minutes: Metrics should remain stable

Explicit verification prevents declaring success prematurely.

Rollback Procedures

If the fix makes things worse, how do you undo it?

Example:

Rollback: If latency increases after scaling:

  1. Scale payment service back to 3 replicas: kubectl scale deployment payment-api --replicas=3
  2. Verify rollback: Check replica count matches target
  3. Monitor impact: Latency should return to pre-scaling levels within 2 minutes

Rollback procedures provide safety nets for risky changes.

Escalation Criteria

When should you stop troubleshooting and get help?

Example:

Escalation: Escalate to database team if:

  • Investigation exceeds 30 minutes without identifying root cause
  • Database query performance degradation confirmed
  • Multiple services affected simultaneously

Contact: Page #database-oncall via PagerDuty

Clear escalation criteria prevent engineers from struggling alone when they need specialized expertise.

Template Examples by Runbook Type

Different scenarios require different template structures. Here are practical examples for common runbook types.

Diagnostic Runbook Template

Use for investigating problems where root cause is unknown.

Title: [Service/Component] Performance Degradation

Purpose: Diagnose performance issues in [service name]

Symptoms:
- [Alert that fires]
- [Observable behavior]
- [User-facing impact]

Investigation Steps:
1. Check recent deployments
2. Review error logs
3. Examine resource utilization
4. Analyze database query performance
5. Verify external dependency health

For each step:
- Command to run
- Expected vs. concerning results
- Next action based on findings

Root Cause Determination:
[Decision tree mapping symptoms to causes]

Resolution:
[Link to appropriate recovery runbook]

Recovery Runbook Template

Use for fixing known problems with established solutions.

Title: [Service Name] Service Restart

Purpose: Safely restart [service] to recover from [specific issue]

When to Use:
- [Triggering conditions]
- [Symptom descriptions]

Impact:
- [Expected downtime]
- [Affected functionality]
- [User experience during recovery]

Prerequisites:
- [Required access]
- [Necessary tools]

Recovery Procedure:
1. [Pre-restart verification]
2. [Backup/safety steps]
3. [Restart command with exact syntax]
4. [Post-restart verification]
5. [Monitoring period]

Verification:
- [Health check procedures]
- [Success criteria]

If Restart Fails:
- [Alternative approach]
- [Escalation path]

Maintenance Runbook Template

Use for planned operational procedures.

Title: [System/Process] Maintenance Procedure

Purpose: [What maintenance accomplishes]

Schedule: [When this runs]

Duration: [Expected time]

Prerequisites:
- [Required notifications]
- [Backup requirements]
- [Access needed]

Maintenance Steps:
1. [Pre-maintenance verification]
2. [User communication]
3. [Backup procedures]
4. [Maintenance actions]
5. [Post-maintenance validation]
6. [Service restoration]

Rollback Plan:
[How to undo if problems occur]

Post-Maintenance:
- [Monitoring requirements]
- [Follow-up communication]

Real-World Example: Complete Diagnostic Runbook

Here’s a full example showing how template components work together:

Title: Payment API High Latency Diagnostic

Owner: Platform Team

Last Updated: 2025-09-15 Last Validated: 2025-09-22

Purpose: Diagnose and identify root cause of elevated payment API latency

Scope: Use when payment API P95 latency exceeds 2000ms for over 5 minutes. Covers diagnostic investigation only—recovery procedures are in separate runbooks.

Prerequisites:

  • Datadog access with payment-service dashboard permissions
  • Kubernetes cluster access (view and describe permissions)
  • Database read-only access

Step 1: Verify Alert Accuracy

Check Datadog payment-service dashboard to confirm latency metrics.

Expected: P95 latency over 2000ms, P50 over 800ms If latency is normal: False alarm, investigate alert configuration If latency is elevated: Continue to Step 2

Step 2: Check Recent Deployments

Review deployment history for the last 2 hours:

kubectl rollout history deployment/payment-api

If deployment occurred within issue timeframe: Likely deployment-related, proceed to Step 8 (Deployment Investigation) If no recent deployments: Continue to Step 3

Step 3: Examine Error Logs

Search for errors in the last 15 minutes:

kubectl logs -l app=payment-api --tail=1000 | grep ERROR

If timeout errors appear frequently: External dependency issue, proceed to Step 6 If authentication errors appear: Auth service problem, escalate to #auth-team If no significant errors: Continue to Step 4

Step 4: Check Resource Utilization

View pod CPU and memory:

kubectl top pods -l app=payment-api

If CPU over 80%: Capacity issue, proceed to Step 7 (Scaling Investigation) If memory over 90%: Possible memory leak, proceed to Step 9 (Memory Investigation) If resources normal: Continue to Step 5

Step 5: Analyze Database Performance

Run query performance check (see payment-db-access runbook for connection details):

SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time OVER 1000
ORDER BY mean_exec_time DESC
LIMIT 10;

If queries over 1000ms: Database performance issue, escalate to #database-oncall If queries normal: Continue to Step 6

Step 6: Verify External Dependencies

Check status pages for:

  • Stripe API status
  • SendGrid API status
  • Auth0 status

If external service degraded: External dependency problem, see External-API-Degradation runbook If all external services healthy: Continue to Step 7

Step 7: Investigation Complete

If all previous steps show normal operation, latency cause is unclear.

Next Actions:

  1. Take thread dump for analysis: kubectl exec payment-api-pod-name -- jstack 1 and save output to threaddump.txt
  2. Escalate to #platform-oncall with investigation findings
  3. Document symptoms and investigation results in incident timeline

Escalation Criteria:

  • Investigation exceeds 30 minutes without identifying root cause
  • Multiple potential causes identified requiring specialized expertise
  • Customer impact continues increasing

Contact: Page #platform-oncall via PagerDuty


This example demonstrates how templates provide structure while remaining flexible for different investigation paths.

Common Template Mistakes

Mistake 1: Too Much Detail

Some templates try to explain every concept and edge case, creating 20-page documents nobody reads during incidents.

Symptom: Engineers skip runbooks because they’re too long Fix: Link to detailed documentation instead of embedding it. Keep runbook steps concise and action-focused.

Mistake 2: Not Enough Detail

Other templates assume too much knowledge, leaving out critical information.

Symptom: Engineers can’t complete procedures because commands are vague or missing Fix: Include copy-paste-ready commands with placeholders for variables. Specify exact file paths and tool locations.

Mistake 3: Rigid Structure

Templates that force every runbook into identical format regardless of scenario create unusable documentation.

Symptom: Teams work around the template instead of using it Fix: Provide a recommended structure, not a mandatory form. Let teams omit irrelevant sections.

Mistake 4: No Ownership

Templates without owners become stale as systems evolve.

Symptom: Runbooks reference deprecated systems or obsolete procedures Fix: Assign ownership per runbook. Template metadata should include owner and last-validated date.

Mistake 5: Static Forever

Teams create templates once and never refine them based on actual usage.

Symptom: Same runbook problems repeat across all procedures Fix: Review templates quarterly. Update based on post-incident feedback and new best practices.

Making Templates Stick

Templates only help if teams actually use them.

Make template use the default. When someone proposes creating a runbook, provide the template automatically. Make using the template easier than starting blank.

Demonstrate value immediately. New templates should save time from day one. If your first template is overly complex, nobody will adopt it.

Integrate with workflow. Store templates where runbooks live. If runbooks are in a wiki, the template should be a wiki page template. If they’re in version control, provide a template file.

Train once, reinforce constantly. Show new team members how to use templates during onboarding. Reference the template in runbook reviews.

Celebrate good examples. When someone creates an excellent runbook using the template, share it as a reference. Recognition reinforces adoption.

Templates That Improve Through Usage

The best runbook templates aren’t static documents—they evolve based on how procedures perform in practice.

Traditional templates treat runbooks as documentation to follow exactly. But real incident response rarely matches perfectly. Engineers skip steps that don’t apply, add missing steps, or modify procedures based on actual conditions.

This gap between planned procedure and actual execution contains valuable information. Which steps get skipped consistently? Which procedures need additional context? What do engineers add that the template misses?

Platforms like Upstat track runbook execution step-by-step during incidents. Engineers record which steps they followed, what they discovered at each stage, and how they adapted procedures for the specific situation. This execution history reveals which templates work well and which need refinement.

After incidents, teams review execution data to improve templates. If engineers consistently skip a step, maybe it’s unnecessary. If they repeatedly add the same context, that should become part of the template. If certain investigations reveal nothing useful, simplify that section.

This creates a continuous improvement loop: templates guide initial response, execution tracking captures actual usage, and feedback drives template refinement. Over time, templates evolve to match how teams actually work instead of how someone imagined they would work.

Start Simple, Iterate Based on Reality

Don’t try to create the perfect template immediately. Start with a basic structure covering title, purpose, steps, and verification. Create runbooks using that template. Gather feedback. Refine.

After teams use the template for a month, ask:

  • What information is consistently missing?
  • Which sections feel like busywork?
  • What do engineers keep adding manually?
  • Which templates work best for which scenarios?

Update your template based on answers. Templates should serve the team, not constrain them.

The goal isn’t documentation perfection. The goal is procedures that work—runbooks that help engineers resolve incidents faster, maintain services reliably, and capture knowledge effectively.

Good templates make that goal achievable by providing proven structure without prescribing rigid process. They standardize what should be consistent while leaving room for scenario-specific adaptation.

Start with a template. Use it. Learn from actual execution. Improve continuously. That cycle turns adequate runbooks into excellent operational procedures that teams trust and actually follow.

Explore In Upstat

Create structured runbooks with step-by-step execution tracking that reveals which procedures actually work in practice, helping templates improve continuously through real-world usage data.