The Untested Runbook Problem
Your database failover runbook looks comprehensive. Twenty-seven steps covering every contingency, complete with commands and validation checks. You’ve never actually executed it. When disaster strikes at 2 AM, step 14 references a script that was moved six months ago. Step 19 assumes database credentials stored in a location that changed during last quarter’s security audit. By step 22, you’re improvising instead of following procedures.
Untested runbooks create false confidence. Teams believe they have solid procedures until production incidents prove otherwise. The problem isn’t bad documentation—it’s that systems change while runbooks sit static. The only way to know if procedures work is testing them under realistic conditions before incidents require them.
Testing reveals what documentation review cannot: broken commands, missing prerequisites, incorrect assumptions, and gaps in logic that only surface during actual execution.
Why Testing Matters More Than Writing
Creating runbooks takes effort. Testing them takes more. Many teams write procedures but never validate they work, assuming well-documented steps will suffice during incidents. This assumption fails repeatedly.
Real systems differ from documentation: Deployment patterns change. Infrastructure evolves. Service dependencies shift. What worked when you documented it may not work now. Testing catches drift before incidents do.
Complexity creates failure points: Multi-step procedures have numerous places where assumptions can be wrong. A single incorrect command early in a twenty-step runbook renders everything afterward useless. Testing identifies these failure points in low-pressure situations.
Time pressure amplifies problems: During incidents, engineers follow procedures quickly under stress. Steps that seem clear when reading become ambiguous during execution. Testing reveals which instructions need clarification before people need them desperately.
Context varies between executions: Runbooks that work for senior engineers who wrote them often confuse junior engineers following them later. Testing with different team members exposes knowledge gaps that documentation doesn’t capture adequately.
Game Days: Deliberate Practice
Game days simulate production incidents in controlled environments, allowing teams to test runbooks without risking actual services. These exercises provide the highest-fidelity testing available.
Schedule regular exercises: Quarterly game days for critical runbooks ensure procedures stay current. Less critical runbooks might be tested annually or when significant system changes occur. Consistency matters more than frequency—regular testing builds confidence and reveals trends.
Simulate realistic scenarios: Game day exercises should mirror actual incident conditions as closely as possible. Use production-like environments with realistic data volumes and service dependencies. Simple test environments miss problems that production complexity reveals.
Include the on-call rotation: Have actual on-call engineers execute runbooks during game days, not just the people who wrote them. This validates that procedures work for their intended audience rather than just subject matter experts who already understand the systems deeply.
Measure execution time: Track how long each runbook execution takes during game days. If a supposedly routine procedure takes two hours instead of the documented thirty minutes, that reveals problems with either the procedure or the time estimate. Both require fixing.
Document everything that goes wrong: The point of game days is finding problems safely. Keep detailed notes about which steps failed, what assumptions were incorrect, what prerequisites were missing, and where instructions confused responders. These notes drive runbook improvements.
Maintenance Windows: Real-System Testing
Scheduled maintenance provides opportunities to test runbooks against actual production systems with controlled risk. Unlike game days, maintenance window testing validates procedures on the real infrastructure that matters.
Execute documented procedures exactly: During maintenance, follow runbooks step-by-step rather than relying on expert knowledge. This verifies documentation accuracy under real conditions. Resist the temptation to skip steps or take shortcuts—the goal is validating procedures, not just completing maintenance.
Test rollback procedures: Maintenance windows offer chances to test rollback and recovery procedures. If applying an update, execute the documented rollback steps to verify they work. Many teams test forward procedures but never validate they can undo changes when necessary.
Validate prerequisites: Runbooks often assume specific system states or available resources. Maintenance windows reveal whether those assumptions hold true. If a runbook assumes certain credentials are available but the engineer performing maintenance lacks access, that gap gets exposed and fixed.
Time the execution: Real-world timing often differs from estimates. Maintenance window execution reveals actual duration, including wait times for systems to restart or databases to complete operations. Update runbooks with realistic time estimates based on measured results.
Record deviations: When maintenance forces you to deviate from documented procedures, note why. Frequent deviations indicate runbooks don’t match operational reality. Either systems changed or documentation was always slightly wrong—either way, testing reveals the mismatch.
Testing After System Changes
Significant system changes often invalidate existing runbooks. Proactive testing after changes prevents discovering broken procedures during the next incident.
Infrastructure migrations: When migrating databases, changing deployment systems, or moving to new infrastructure, test all affected runbooks immediately. Don’t wait for scheduled game days—validate procedures work with new infrastructure before relying on them.
Dependency updates: Changing service dependencies, updating frameworks, or modifying configuration management systems can break runbook procedures. Test runbooks that interact with changed components to verify they still work correctly.
Tool changes: Switching monitoring tools, changing deployment platforms, or adopting new command-line utilities requires runbook testing. Commands and workflows that reference old tools need updates and validation.
Team structure changes: When team membership changes significantly or responsibilities shift between groups, test runbooks to ensure new team members can execute procedures successfully. Knowledge that seemed obvious to previous maintainers might not be to new ones.
What to Validate During Testing
Effective testing goes beyond just executing steps—it validates that procedures achieve intended outcomes and handle edge cases appropriately.
Command accuracy: Every command in a runbook must execute successfully with correct syntax. Test each one. Verify commands produce expected output and don’t require additional flags or arguments not documented.
Prerequisites and permissions: Confirm responders have necessary access, credentials, and tools before starting procedures. Missing prerequisites are common failure points that testing exposes.
Timing assumptions: Validate how long operations actually take. If a runbook says “wait 30 seconds for service restart” but restarts actually take 2 minutes, the procedure fails. Test reveals accurate timing.
Decision points: For runbooks with branching logic, test multiple paths through decision trees. Verify that each branch leads to appropriate next steps and that decision criteria are clear enough to answer reliably.
Rollback procedures: Test not just forward procedures but also rollback and recovery steps. Verify you can undo changes if things go wrong. Many procedures document happy paths but leave recovery undefined.
Edge cases and variations: Test procedures under different conditions. What if the database is under load? What if certain services are already down? What if multiple issues exist simultaneously? Realistic testing includes complications that clean test environments omit.
Learning from Test Results
Testing generates valuable data that improves runbooks and operational practices. The goal isn’t passing tests—it’s learning what needs fixing.
Track failure patterns: If the same steps fail repeatedly across multiple tests or multiple runbooks, that indicates systemic issues. Perhaps access controls are configured inconsistently, or a common tool has changed. Identifying patterns guides improvements beyond individual runbook fixes.
Measure improvement over time: Compare test results for the same runbook across multiple quarters. Are execution times decreasing? Are failure rates dropping? Are fewer manual corrections needed? These trends reveal whether testing drives actual improvement or just produces reports.
Share learnings across teams: When testing reveals problems, check if other teams’ runbooks have similar issues. If one database runbook has incorrect connection strings, others probably do too. Systematic fixes prevent rediscovering the same problems repeatedly.
Update based on findings immediately: Don’t batch runbook fixes. Update procedures right after test failures while details are fresh. Waiting weeks to fix known problems wastes the value testing provides.
Test the fixes: After updating runbooks based on test results, retest those procedures to verify fixes work. Changes intended to improve procedures sometimes introduce new problems. Validation completes the improvement cycle.
Chaos Engineering and Runbook Testing
Chaos engineering deliberately introduces failures to test system resilience. This approach validates runbooks under realistic failure conditions that scheduled testing cannot reproduce safely.
Fire drills test real responses: Deliberately creating production incidents in controlled ways tests whether runbooks handle unexpected failures. If automated chaos experiments delete random pods, can on-call engineers follow documented recovery procedures successfully?
Validate detection first: Before runbooks can help, teams must detect problems. Chaos experiments verify that monitoring alerts correctly, pages reach the right people, and initial diagnostic runbooks identify failure causes accurately.
Test degraded conditions: Chaos engineering reveals how runbooks perform when multiple things fail simultaneously or when systems operate in degraded states. Clean maintenance windows can’t replicate the messy reality of cascading failures.
Automate regular testing: Continuous chaos experiments running between deployments provide ongoing runbook validation. Rather than quarterly game days, automated testing catches runbook drift quickly as systems evolve.
Execution Tracking as Testing Mechanism
Every real runbook execution during incidents or maintenance provides testing data, even when tests aren’t the primary goal. Tracking execution reveals which procedures work in practice versus theory.
Capture execution history: Recording which steps responders followed, which they skipped, which they modified, and how long each took provides evidence about runbook effectiveness. Pattern recognition across multiple executions shows which procedures need improvement.
Decision tracking reveals paths: Runbooks with decision trees create different paths depending on conditions. Tracking which branches get used frequently versus rarely helps prioritize testing and improvement efforts on high-traffic procedures.
Failure data drives updates: When runbook executions don’t resolve incidents, that’s testing data. Execution history showing frequent abandonment or repeated failures for specific steps indicates procedures needing fixes.
Success patterns validate procedures: Executions that resolve incidents quickly with minimal deviations validate that procedures work. Success data helps identify your best runbooks to use as templates for improving weaker ones.
Platforms like Upstat track runbook execution during incidents and maintenance, recording exactly which steps responders followed, what decisions they made at branching points, and how long procedures took. This execution tracking acts as continuous testing—every real-world use reveals whether procedures work as intended. Pattern analysis across executions identifies steps that responders consistently skip or modify, signaling procedures needing updates.
Common Testing Antipatterns
Several approaches to testing sound reasonable but consistently fail to provide real value.
Testing only happy paths: Procedures that work perfectly under ideal conditions often fail when things go wrong—precisely when you need them most. Test edge cases and failure scenarios, not just clean executions.
Testing with experts only: If only the engineers who wrote runbooks test them, you miss knowledge gaps that junior responders will hit during real incidents. Test with diverse team members at different experience levels.
Infrequent testing cycles: Annual testing catches problems slowly. By the time you discover runbooks don’t work, they’ve been broken for months. More frequent testing provides faster feedback.
Test without updating: Running tests but never fixing identified problems wastes effort. Testing’s value comes from improvement, not just measurement. Update procedures based on findings immediately.
Perfection before deployment: Waiting until runbooks are perfect before using them ensures they never get tested under real conditions. Deploy procedures early, test them, and improve through usage data.
Getting Started with Testing
Building a testing practice doesn’t require elaborate infrastructure or dedicated teams. Start small with high-value procedures and expand as testing demonstrates benefits.
Identify your critical runbooks: Which procedures would cause the most problems if they failed during incidents? Those are your testing priorities. Start with runbooks for your most important services and most frequent incident types.
Schedule the first game day: Pick one critical runbook and schedule a two-hour game day exercise within the next month. Invite relevant team members, simulate a realistic scenario, and execute the procedure step-by-step. Document everything that goes wrong.
Test during next maintenance: When you next perform scheduled maintenance, commit to following documented runbooks exactly rather than relying on expert knowledge. Note every deviation and unclear instruction.
Track execution during incidents: Start recording which runbooks get used during real incidents, which steps work, and which cause problems. This execution data guides testing priorities and improvement efforts.
Make testing routine: Once initial testing proves valuable, establish regular cadences. Quarterly game days for critical runbooks, testing after every significant system change, and maintenance window validation become standard practices.
Final Thoughts
Testing runbooks transforms documentation from hopeful guidance into validated procedures teams trust during critical moments. Untested runbooks create dangerous illusions of preparedness that crumble when real incidents strike.
Effective testing takes multiple forms: game day exercises that simulate realistic scenarios, maintenance window validation on real systems, testing after changes that might break procedures, and learning from execution tracking during actual incidents. No single approach suffices—comprehensive testing combines multiple strategies.
The goal isn’t achieving perfect procedures—it’s discovering problems safely so you can fix them before they matter. Every test that reveals broken steps or unclear instructions is a success because it prevents discovering those problems during high-pressure incidents.
Start testing your most critical runbooks this quarter. Schedule game days, commit to following procedures during maintenance, track execution during incidents, and update based on findings immediately. Your future incident responders will thank you when documented procedures actually work under pressure.
Explore In Upstat
Track runbook execution during incidents and maintenance to identify which steps work in practice and which need refinement based on real usage patterns.
