What challenges did the team face with scattered runbooks?

The team spent an average of 18 minutes per incident searching for procedures across Google Docs and Confluence, resulting in inconsistent responses and 120-minute mean time to resolution.

What features made the biggest impact?

The team found the most value in service catalog linking that automatically surfaced runbooks during incidents, step-by-step execution tracking that ensured consistent procedures, and decision branching for conditional scenarios.

How long did implementation take?

The team completed deployment in 6 weeks, including migrating 12 critical runbooks, creating service catalog entities for 15 microservices, and configuring automatic runbook suggestions.

API Provider Cuts Resolution Time 40% with Runbooks

Q: How did integrated runbooks reduce resolution time?

Integrated runbooks eliminated procedure search time by automatically suggesting relevant documentation based on the affected service, enabling engineers to find procedures in under 2 minutes instead of 18 minutes.

Overview

A B2B API platform providing payment processing infrastructure to enterprise developers faced a critical operational challenge: their incident response process was slow and inconsistent. With a 28-person engineering team managing 15 microservices that processed over 50 million API calls daily for 800 business customers, operational procedures were scattered across Google Docs, Confluence pages, and tribal knowledge held by senior engineers.

During incidents, engineers wasted an average of 18 minutes searching for the correct diagnostic procedure. Different team members followed different troubleshooting steps for the same problems, leading to inconsistent response quality and extended resolution times. Junior engineers hesitated to respond to incidents without senior guidance because they could not quickly locate the right procedures.

The engineering leadership recognized that this documentation fragmentation was directly impacting their 120-minute mean time to resolution. They needed a way to centralize runbooks, link procedures to specific services, and make diagnostic steps immediately accessible during high-pressure incidents.

Results at a Glance

Metric	Before Upstat	After Upstat	Improvement
Mean Time to Resolution	120 minutes	72 minutes	40% reduction
Time Finding Procedures	18 minutes	2 minutes	89% reduction
Runbook Usage Rate	45%	92%	104% increase
Execution Documentation	15%	95%	533% increase

The Challenge

Scattered Documentation Across Multiple Tools

The platform team had accumulated 35 runbooks over three years, but they were distributed across multiple systems. Approximately 35 percent existed in Google Docs with varying levels of access control. Another 40 percent lived in Confluence pages organized inconsistently across different team spaces. The remaining 25 percent existed only as tribal knowledge in senior engineers’ heads, never formally documented.

When an alert fired for a database connection pool saturation issue at 2 AM, the on-call engineer faced an immediate problem: finding the correct diagnostic procedure. Was it in the database team’s Confluence space? Had someone written a Google Doc during the last incident? Should they wake the database architect who knew the steps by heart?

This search consumed an average of 18 minutes per incident before troubleshooting even began. For a team targeting sub-60-minute resolution times, losing nearly a third of that window to documentation search was unacceptable.

Inconsistent Response Procedures

Without centralized, authoritative runbooks, different engineers developed their own diagnostic approaches. Three different engineers responding to the same API gateway timeout issue might investigate in three different orders: one checking upstream service health first, another examining network connectivity, a third reviewing request queues.

This variation meant that incident response quality depended heavily on which engineer was on-call. An experienced engineer might resolve an issue in 45 minutes by following an optimal diagnostic sequence. A less experienced engineer working the same issue might take 140 minutes, pursuing less effective troubleshooting paths.

Post-incident reviews revealed this inconsistency but offered no solution. The team had no standardized procedures that everyone agreed to follow, and no mechanism to enforce consistency even if they had documented standard approaches.

No Connection Between Services and Procedures

The platform’s 15 microservices each had specific failure modes requiring different diagnostic approaches. The payment authorization service failed differently than the webhook delivery service. The rate limiting service had unique troubleshooting steps distinct from the API gateway.

Engineers needed to remember which runbook applied to which service. When the webhook delivery service alert fired, was the relevant runbook named webhook-troubleshooting, message-delivery-issues, or async-processing-failures? The naming was inconsistent, and there was no systematic way to find service-specific procedures.

Senior engineers had this mapping memorized through repeated exposure. Junior engineers frequently asked in Slack: what runbook should I use for this service? This created dependencies on specific team members and delayed response when those experts were unavailable.

Missing Execution History

Even when engineers found and followed runbooks, there was no record of which steps they actually completed. Post-incident reviews asked questions the team could not answer: Did we try restarting the cache layer? When did we check the database connection pool? Who validated the API key rotation?

Engineers sometimes documented their actions in Slack threads or incident tickets, but this was inconsistent and often incomplete. The lack of execution tracking meant repeat incidents required rediscovering the same diagnostic steps rather than referencing what worked previously.

The Impact on Resolution Time

These documentation problems compounded to create a mean time to resolution of 120 minutes. Breaking down the incident lifecycle:

Detection: 5 minutes (automated monitoring worked well)
Finding Procedures: 18 minutes (searching scattered documentation)
Diagnosis: 52 minutes (inconsistent troubleshooting approaches)
Remediation: 35 minutes (executing fixes)
Validation: 10 minutes (confirming resolution)

The team recognized that nearly 60 percent of their resolution time was spent on activities that could be improved through better documentation practices: finding procedures and inconsistent diagnosis approaches.

How Did Integrated Runbooks Reduce Resolution Time?

The platform team evaluated several approaches to solving their runbook problem: hiring technical writers to better organize existing documentation, implementing a documentation-first culture through process changes, or adopting tooling that integrated runbooks directly into incident response workflows.

They chose Upstat because it addressed the root cause of their problems through three core capabilities: centralized runbook storage with structured management, service catalog integration that linked procedures to specific services making them immediately accessible during incidents, and step-by-step execution tracking that ensured consistent responses while building an audit trail.

Centralized Runbook Storage with Structure

The team migrated their 12 most critical runbooks from scattered Google Docs and Confluence pages into Upstat’s centralized runbook system. Each runbook followed a consistent structure: overview describing the failure mode, prerequisite checks before beginning diagnosis, step-by-step diagnostic procedures, remediation actions based on findings, and validation steps to confirm resolution.

Runbooks supported decision-driven branching for conditional scenarios. The database connection pool saturation runbook included branching logic: if connection count exceeds 90 percent of the pool limit, follow the scale-up procedure; if connection count is normal but latency is high, follow the query optimization procedure; if both metrics are normal, escalate to the database architect.

The centralized system provided a single authoritative location for all operational procedures. When an engineer updated the API gateway timeout procedure based on a recent incident, the current runbook reflected the latest best practices. This eliminated the documentation drift and version conflicts that had plagued their scattered Google Docs and Confluence approach.

Service Catalog Integration for Automatic Procedure Access

The team created catalog entities for their 15 core microservices, documenting each service’s purpose, dependencies, and operational characteristics. They then linked runbooks to the specific services they addressed: the payment authorization runbook linked to the payment service entity, the webhook delivery runbook linked to the webhook service entity.

This linking made relevant runbooks immediately accessible during incidents. When a monitor detected the webhook delivery service was failing and automatically created an incident with the affected service identified, the webhook delivery troubleshooting runbook appeared in the incident view because of its link to the webhook service catalog entity. Engineers no longer needed to remember which runbook applied to which service or search through documentation folders.

The service catalog also mapped dependencies between services. When the payment authorization service incident occurred, engineers could immediately see that it depended on the database cluster, the API gateway, and the encryption service. If the runbook’s initial diagnostic steps ruled out payment service code issues, the dependency map showed the next systems to investigate.

Step-by-Step Execution Tracking

Each runbook consisted of individual steps that engineers could check off as they completed them during incident response. The payment authorization runbook’s diagnostic section included steps like check database connection pool utilization, verify API gateway health, examine recent deployment history, and review error logs for exception patterns.

As the on-call engineer worked through these steps, they marked each one complete in real-time. This created an execution audit trail visible to everyone monitoring the incident. A second engineer joining the investigation 20 minutes later could immediately see which diagnostic steps had already been completed and what the findings were, preventing duplicate work.

The execution history also supported learning and improvement. After resolving an incident, the team could review exactly which steps were followed, in what order, and how long each took. This data informed runbook refinements: steps that consistently revealed root causes moved earlier in the procedure; steps that rarely provided useful information were removed or made optional.

Consistent Response Across the Team

With centralized runbooks automatically suggested based on the failing service and execution tracking ensuring complete follow-through, response procedures became consistent regardless of which engineer was on-call. The experienced engineer who had responded to webhook delivery issues a dozen times followed the same steps as the junior engineer encountering that failure mode for the first time.

This consistency meant incident resolution time became more predictable. The variance in resolution time decreased significantly because everyone followed proven diagnostic procedures rather than improvising based on personal experience.

Junior engineers gained confidence responding to incidents independently. The runbooks provided the structured guidance they needed, and execution tracking showed them exactly what to do next. The dependency on senior engineers for every incident decreased, distributing the on-call load more evenly across the team.

Integration with Incident Management

Runbooks appeared directly within the incident timeline alongside communication, status updates, and actions taken. Engineers did not need to context-switch between tools or lose their place when moving between the incident coordination view and the runbook execution view.

When multiple engineers collaborated on an incident, they could all see the runbook execution status in real-time. As one engineer completed diagnostic step three, the others immediately saw that progress. Coordination improved because everyone had shared visibility into which procedures were being followed.

What Was the Implementation Process?

The platform team completed runbook integration in six weeks through a phased deployment that maintained continuous incident response capability throughout the migration.

Weeks 1-2: Runbook Migration and Standardization

The team identified their 12 most critical runbooks based on incident frequency data from the previous six months. These covered 75 percent of all incidents by volume and included procedures for database connection pool issues, API gateway timeouts, webhook delivery failures, rate limiting problems, authentication service outages, and cache invalidation scenarios.

For each runbook, they conducted a standardization exercise. The team brought together the engineers who typically responded to that incident type and synthesized their different approaches into a single authoritative procedure. The database connection pool runbook combined the approaches of three different engineers who each had slightly different diagnostic sequences.

Runbooks were structured consistently: an overview section describing symptoms and impact, prerequisite checks to validate before beginning diagnosis, diagnostic steps with expected outputs and decision branches, remediation procedures based on diagnosis results, and validation steps to confirm the issue was resolved.

Decision-driven branching proved particularly valuable for complex scenarios. The API gateway timeout runbook included branching based on error rate patterns: if timeout rate exceeds 10 percent, follow the upstream service investigation procedure; if timeout rate is under 5 percent with specific user patterns, follow the API key quota procedure; if timeouts correlate with deployment times, follow the rollback procedure.

Week 3: Service Catalog Configuration

The team created catalog entities for their 15 core microservices. Each entity documented the service’s purpose, owning team, deployment frequency, dependencies, and operational characteristics. The payment authorization service entity linked to the database cluster entity it depended on, the API gateway entity it communicated through, and the encryption service entity it used for sensitive data.

Runbooks were then linked to the catalog entities they addressed. The payment authorization troubleshooting runbook linked to the payment authorization service entity. The database connection pool runbook linked to the database cluster entity. This linking enabled automatic runbook suggestion when incidents occurred.

The service catalog also provided business context during incidents. The webhook delivery service entity documented which customer accounts relied most heavily on webhook notifications. When that service failed, engineers immediately knew the business impact without needing to ask product managers or customer support.

Week 4: Incident Integration and Automation

The team configured their existing monitors to automatically create incidents in Upstat when failures were detected. A monitor checking payment authorization service health created an incident with severity 1 when failure rate exceeded 5 percent. The webhook delivery health check created an incident with severity 2 when delivery queue depth exceeded threshold.

These automatically created incidents included the affected service from the service catalog, which triggered automatic runbook suggestion. An engineer arriving at a payment authorization service incident immediately saw the payment authorization troubleshooting runbook suggested in the incident view, ready to execute.

Incident timelines became the central coordination point. Engineers documented their runbook execution progress directly in the timeline, creating a unified view of both what was being tried and what people were discussing. The previous fragmentation between Slack discussions and separate runbook execution disappeared.

Weeks 5-6: Training and Parallel Operation

All 28 engineers received training on the new runbook system through hands-on practice incidents. The team simulated common failure scenarios in a staging environment and had engineers practice executing runbooks with step-by-step tracking.

For two weeks, the team ran parallel operation, maintaining their old documentation approach while also using the new integrated runbooks. This validated that the runbooks covered the necessary scenarios and that automatic suggestion was surfacing the correct procedures for each incident type.

Engineers provided feedback on runbook quality, which led to immediate refinements. The API gateway timeout runbook added a step about checking recent DNS changes after an engineer discovered that was a common root cause. The webhook delivery runbook clarified the decision criteria for when to restart services versus scaling workers.

By week six, the team had fully transitioned to the integrated runbook approach, decommissioning the scattered Google Docs and Confluence pages in favor of the centralized, versioned system.

What Results Did the Team Achieve?

The platform team tracked detailed metrics before and after runbook integration to validate operational impact. The improvements appeared immediately and sustained through four months of measurement.

40 Percent Reduction in Mean Time to Resolution

Mean time to resolution decreased from 120 minutes to 72 minutes, a 40 percent improvement. Breaking down where the time savings came from:

Procedure Access Time: Dropped from 18 minutes to 2 minutes (89 percent reduction). Automatic runbook suggestion based on the affected service eliminated the documentation search. Engineers clicked directly into the relevant runbook from the incident view.

Diagnostic Time: Decreased from 52 minutes to 35 minutes (33 percent reduction). Structured diagnostic procedures with decision branching guided engineers to root causes faster. The consistent approach meant effective troubleshooting sequences were followed every time rather than relying on individual engineer experience.

Remediation Time: Remained approximately the same at 35 minutes. This represented the actual work of fixing issues, which runbooks could guide but not directly accelerate.

Validation Time: Dropped from 10 minutes to 6 minutes (40 percent reduction). Runbooks included specific validation steps with expected results, making it clear when an incident was truly resolved versus still degraded.

The compounding effect of these improvements across the incident lifecycle produced the 40 percent overall reduction in resolution time. Incidents that previously took two hours now resolved in 72 minutes, significantly reducing customer impact.

89 Percent Faster Procedure Location

Time spent finding the correct runbook during an incident dropped from 18 minutes to 2 minutes. The 2 remaining minutes represented the time to read the runbook overview and confirm it matched the current failure symptoms, not time spent searching for documentation.

Service catalog linking eliminated the search problem entirely. When the webhook delivery service monitor triggered an incident, the system identified the affected service from the catalog. Runbooks linked to the webhook service appeared immediately in the incident view. No searching through folders, no asking in Slack which runbook to use, no trying to remember file names.

This improvement had a particularly strong impact on junior engineers and on-call shifts during off-hours. The senior engineer with three years of experience no longer had a significant advantage in procedure location over the engineer who joined three months ago. Both saw the same automatic runbook suggestion and could begin troubleshooting immediately.

104 Percent Increase in Runbook Usage

Runbook usage rate increased from 45 percent to 92 percent of incidents. Before integration, engineers frequently skipped runbooks because the friction of finding them was too high. During a high-pressure incident at 3 AM, spending 18 minutes searching for documentation felt inefficient, so engineers improvised based on experience.

With catalog linking reducing access time to under 2 minutes, runbooks became the default approach rather than an optional resource. Engineers followed structured procedures for the vast majority of incidents because the procedures were immediately available and execution tracking made progress visible.

The 8 percent of incidents where runbooks were not used typically represented novel failure modes that had never occurred before. These incidents drove runbook creation: after resolving a new issue, the team documented the diagnostic approach as a new runbook to handle future occurrences.

533 Percent Increase in Execution Documentation

The percentage of incidents with complete procedure execution documentation increased from 15 percent to 95 percent. Step-by-step execution tracking made documentation automatic rather than manual. As engineers worked through runbook steps, checking them off created the execution record without additional effort.

This execution history proved valuable for multiple purposes. Post-incident reviews could see exactly what was tried and in what order, identifying opportunities to improve runbooks or incident response. Engineers debugging repeat incidents could reference previous executions to understand what had worked before. New team members studying historical incidents could learn proven troubleshooting approaches.

The 5 percent of incidents without complete execution documentation typically involved manual escalations to specialized experts who resolved issues outside the standard runbook procedures. Even in these cases, partial execution documentation existed for the diagnostic steps that were followed before escalation.

Improved Cross-Team Collaboration

Incidents requiring coordination between multiple teams became smoother with integrated runbooks. When a payment authorization issue required investigation by both the API platform team and the database team, both teams could see the same runbook execution status in real-time.

The database team could observe that the API platform team had already ruled out application-level issues and validated network connectivity, allowing them to focus their investigation on database-specific causes. This visibility eliminated repeated questions and duplicated work between teams.

The centralized execution tracking also improved handoffs during timezone transitions. When the US West Coast engineer ended their on-call shift and handed an ongoing incident to the US East Coast engineer, the incoming engineer could immediately see which runbook steps had been completed and what findings had been discovered.

Accelerated Onboarding for New Engineers

New engineers joining the team previously required 4-6 weeks before taking on-call responsibility, largely due to the undocumented tribal knowledge required for effective incident response. With integrated runbooks providing structured guidance for common scenarios, that onboarding time decreased to 2-3 weeks.

New engineers could shadow experienced responders by watching runbook execution in real incidents, then gradually take on incident response with the confidence that runbooks would guide them through proven procedures. The step-by-step structure made it clear what to do next, reducing dependency on senior engineers for every decision.

Historical execution records also served as training material. New engineers studied how previous incidents were diagnosed and resolved, learning the troubleshooting approaches the team had refined over time.

Key Takeaways

Service catalog linking eliminated the 18 minutes engineers previously spent searching for documentation by making runbooks for affected services immediately accessible in the incident view, enabling troubleshooting to begin within 2 minutes.

Step-by-step execution tracking ensured consistent response procedures across all engineers regardless of experience level, reducing diagnostic time by 33 percent through proven troubleshooting sequences.

Decision-driven branching in runbooks guided engineers through conditional scenarios, preventing the wasted time that occurred when engineers pursued incorrect troubleshooting paths for complex failures.

Centralized storage with single authoritative location prevented runbook documentation drift and ensured all engineers worked from current, accurate procedures rather than outdated documents scattered across Google Docs.

The 40 percent reduction in mean time to resolution came from compounding improvements across the incident lifecycle: faster procedure access, more effective diagnosis, and clearer validation criteria, demonstrating that runbook integration accelerates response through systematic documentation rather than individual heroics.

Ready to Accelerate Incident Resolution?

See how Upstat's integrated runbooks eliminate procedure search time and ensure consistent response across your team.

Explore Runbook Management

How an API Provider Cut Resolution Time 40% with Integrated Runbooks