Case Studies  /  remote-team-incident-coordination-five-countries

How a Remote-First Team Improved Incident Coordination Across 5 Countries

A distributed engineering team spanning San Francisco to Singapore transformed their incident response by implementing follow-the-sun on-call coverage, reducing mean time to resolution from 52 to 18 minutes while cutting off-hours wake-ups by 85%.

November 7, 2025 16 min read
case-study

A B2B SaaS company with 35 engineers distributed across 5 countries faced a critical operations challenge: their incident response process was optimized for a single timezone, creating coordination chaos as their team scaled globally. Engineers in Singapore were getting woken at 3 AM for US timezone incidents. Critical context disappeared during handoffs between European and Asia-Pacific teams. Finding the right expert during off-hours took an average of 45 minutes.

The engineering team knew they needed a better approach but were constrained by traditional on-call tools designed for co-located teams. After evaluating options, they implemented a follow-the-sun incident management model using multi-timezone on-call scheduling, centralized incident coordination, and automated handoff protocols.

The results were immediate and measurable: mean time to resolution decreased from 52 minutes to 18 minutes, time to locate experts dropped from 45 minutes to 8 minutes, and off-hours wake-ups reduced by 85 percent. More importantly, the team eliminated the coordination breakdowns that had plagued their distributed incident response.

Results at a Glance

Metric Before Implementation After Implementation Improvement
Mean Time to Resolution (MTTR) 52 minutes 18 minutes 65% reduction
Time to Find Right Expert 45 minutes 8 minutes 82% reduction
Cross-Timezone Handoff Time 30 minutes 5 minutes 83% reduction
Context Loss Incidents 12 per month 2 per month 83% reduction
Off-Hours Wake-Ups 8-10 per week 1-2 per week 85% reduction
Incident Communication Channels 5+ scattered channels 1 centralized timeline 100% consolidation

What Problem Were They Trying to Solve?

TechFlow Solutions, a cloud-based project management platform serving 2,500 business customers, had grown from a 12-person US-based team to 43 people across five countries in two years. Their infrastructure consisted of 40+ microservices running on Kubernetes, generating hundreds of alerts each week.

The distributed team structure created operational advantages: diverse perspectives, round-the-clock presence, and access to global talent. However, their incident response process had not evolved to match their geographic distribution.

Scattered Communication During Critical Incidents

When incidents occurred, coordination happened across five or more different Slack channels simultaneously. The US team used channel incident-response, the UK team preferred ops-alerts, and the Singapore engineer posted updates in general. Critical information got buried in chat history. Engineers joining an investigation hours later spent significant time piecing together what had already been tried.

During one particularly chaotic database outage, three engineers in different timezones independently investigated the same symptoms, unaware their colleagues had already diagnosed the root cause. The incident took 94 minutes to resolve - 40 minutes of which was duplicate investigation work.

Broken Handoffs Between Timezones

The company attempted 24/7 on-call coverage using weekly rotations, meaning whoever was on-call handled incidents regardless of their timezone. Engineers in Singapore routinely got woken at 2 or 3 AM when US-based systems failed during peak traffic hours. The UK team often worked alone for eight hours before US team members came online to help.

Context loss during regional handoffs was particularly problematic. When the UK team finished their workday and handed an ongoing incident to the US team, critical details disappeared. Had they tried restarting the cache layer? Which customer accounts were affected? The handoff conversation happened over Slack DM or quick Zoom call, with no written record.

Time-Consuming Expert Location

Finding the right person to diagnose specific technical issues during off-hours consumed an average of 45 minutes. The person on-call might not have expertise in the failing system. They would ping multiple Slack channels, wait for responses from sleeping colleagues, and eventually escalate through management chains.

For one payment processing incident at 7 AM London time, the UK engineer on-call spent 52 minutes trying to reach someone familiar with the payment gateway integration. The expert was in California, asleep. By the time they connected and diagnosed the issue, customers had experienced 90 minutes of degraded checkout functionality.

Holiday Management Complexity

Different countries observed different holidays. The German engineer was off for Day of German Unity. The UK team had bank holidays the US team did not. Singapore celebrated different public holidays than everyone else. Coordinating on-call coverage around these holidays required manual spreadsheet tracking and frequent last-minute scrambles to fill gaps.

During one UK bank holiday weekend, an incident occurred and the supposed backup on-call was also traveling. It took 38 minutes to find someone available to respond - purely because holiday coverage had not been properly coordinated across timezones.

Lack of Cross-Timezone Visibility

No one had clear visibility into which region was currently handling incidents. Executives asking “who is working on this?” received vague responses. Customer support could not confidently tell affected customers when they could expect resolution.

The head of operations described the situation: “We had talented engineers around the world, but during incidents it felt like they were working in isolation rather than as a coordinated team. The tools we were using were designed for everyone to be in one building.”

How Did Follow-the-Sun Incident Management Solve This?

The engineering leadership team evaluated several approaches: hiring dedicated follow-the-sun incident managers, implementing more rigorous handoff processes in their existing tools, or adopting a platform designed for distributed incident coordination.

They chose to implement a follow-the-sun incident management model with three complementary on-call rosters aligned to regional business hours, centralized incident coordination, and automated holiday management.

Multi-Timezone On-Call Rosters with IANA Timezone Support

Rather than one global rotation, they established three regional rosters:

Americas Roster: Covered 8 AM to 8 PM Pacific Time, including US West Coast and US East Coast engineers (12 team members total). Used FairDistribution rotation algorithm to balance workload across both coasts.

EMEA Roster: Covered 8 AM to 8 PM GMT, including London and Berlin teams (12 team members total). Configured with IANA timezones Europe/London and Europe/Berlin to handle daylight saving transitions correctly.

APAC Roster: Covered 8 AM to 8 PM Singapore Time, including Singapore engineers plus two contractors in Sydney (5 team members total). Used Asia/Singapore and Australia/Sydney timezone identifiers.

Each roster was configured with the specific IANA timezone identifier for accurate local time display. When an engineer in Berlin logged in, they saw the schedule in Central European Time. When the San Francisco team viewed the same roster, they saw it converted to Pacific Time.

The FairDistribution rotation algorithm automatically balanced on-call duty across team members, accounting for recent incident volume and past on-call hours. This prevented the same people from being repeatedly assigned during high-incident periods.

Automated Holiday Calendar Integration

The system integrated with Google Calendar holiday feeds for each country: US federal holidays, UK bank holidays, German public holidays, and Singapore public holidays. When generating the on-call schedule, the algorithm automatically avoided assigning someone to on-call duty on their regional holidays.

The Berlin engineer no longer needed to manually request coverage for Day of German Unity. The UK team did not need spreadsheet tracking for bank holidays. Holiday coverage became automatic and reliable.

When a scheduled on-call person had a holiday, the system automatically assigned the next eligible person in the rotation. The override system allowed for manual coverage requests when someone needed to swap shifts for vacation or other reasons.

Centralized Incident Timeline as Single Source of Truth

All incident communication moved to a centralized incident timeline visible to every team member regardless of location or timezone. When an alert fired and created an incident, a dedicated timeline started capturing every update, investigation step, and status change.

The UK engineer investigating at 9 AM London time documented their findings in the timeline. When the Singapore team came online later that day and the incident was still ongoing, they saw exactly what had been tried, which services were affected, and what hypotheses had been ruled out. No more piecing together information from scattered Slack channels.

The timeline showed real-time presence tracking - who was currently viewing or actively working on the incident. When the UK engineer joined an incident created by the US team, their presence indicator appeared. Other team members knew immediately that someone was investigating, preventing duplicate work.

For the database outage scenario that previously took 94 minutes with duplicate investigation, the centralized timeline made investigation history immediately visible. The second engineer to join could see the first engineer had already checked connection pooling and was investigating query performance. No duplicate work. Resolution time dropped to 21 minutes.

Structured Handoff Protocols with Overlap

The team established formal handoff protocols between regions with 30-minute overlap periods. For the EMEA-to-Americas handoff at 5:30 PM GMT, both the outgoing London engineer and the incoming US engineer were available simultaneously.

If an incident was still ongoing during handoff, they had a brief synchronous conversation in the incident timeline to transfer context. The written timeline record meant nothing was lost even if the conversation was rushed.

For incidents that resolved before handoff, the incoming on-call engineer started their shift with full visibility into what happened during the previous region’s coverage through the incident timeline.

The Asia-Pacific to EMEA handoff at 5:30 PM Singapore time followed the same pattern. The 30-minute overlap gave the Singapore team time to brief the UK team on overnight activity.

Service Catalog for Impact Context

The team built a service catalog documenting their 40+ microservices with dependency relationships. When an incident affected the payment processing service, the catalog immediately showed which upstream services depended on it and which downstream services it required.

This dependency visibility eliminated the 15-20 minutes previously spent during incidents asking “what else is affected?” The impact was immediately clear. Customer-facing services depending on the failed service were automatically identified.

The catalog also mapped customer accounts to service dependencies. When a critical database serving EU customers experienced issues, the support team instantly knew which customers to proactively notify rather than waiting for complaints.

Consistent Runbook Procedures

The team documented their top 15 incident types in runbooks accessible to all regions. Rather than each timezone developing slightly different response procedures, everyone followed the same step-by-step guides.

When a Redis cache failure occurred at 2 AM Pacific Time (handled by APAC roster), the Singapore engineer followed the documented runbook procedure: check connection count, verify memory utilization, restart replica if needed, validate read performance. The same procedure the US team would follow.

The runbooks included decision-driven branching: “If memory utilization exceeds 90 percent, follow scale-up procedure. Otherwise, follow restart procedure.” This consistency eliminated the variation in response quality across timezones.

Each runbook was associated with specific services in the catalog, so when an incident was created for a monitored service, the relevant runbook was automatically linked in the incident timeline.

Mobile-Responsive Access

Engineers could access the incident timeline, on-call schedule, and runbooks from mobile devices. The Singapore engineer checking the on-call schedule while commuting saw the same information as the Berlin engineer at their desk.

During a critical incident while the UK lead engineer was traveling, they participated in incident response from their phone during a train journey. The mobile interface showed presence tracking, allowed timeline updates, and provided access to the service catalog for impact analysis.

What Were the Measurable Results?

The engineering team tracked detailed metrics before and after implementation to validate the operational impact. The improvements appeared within the first month and sustained through six months of measurement.

65 Percent Reduction in Mean Time to Resolution

MTTR decreased from 52 minutes to 18 minutes - a 65 percent improvement. The centralized incident timeline eliminated time spent searching for context. Service catalog dependency analysis immediately showed impact scope. Consistent runbook procedures reduced diagnostic time.

For P1 critical incidents affecting customer-facing functionality, MTTR improved even more dramatically: from 73 minutes to 22 minutes (70 percent reduction). These high-severity incidents benefited most from faster expert location and immediate impact understanding.

82 Percent Faster Expert Location

Time to locate the correct expert during off-hours dropped from 45 minutes to 8 minutes. The follow-the-sun model meant someone with relevant expertise was always available during their business hours rather than asleep.

The service catalog responsibility tags clearly identified which team or engineer handled each system. When the payment gateway failed, the system automatically notified the engineers tagged for payment processing - no manual searching required.

The 8 minutes remaining typically represented the time for the expert to acknowledge the alert and join the incident coordination, not time spent searching for them.

83 Percent Reduction in Context Loss

Monthly incidents involving significant context loss during timezone handoffs decreased from 12 to 2 (83 percent reduction). The centralized timeline captured all investigation steps, eliminating information loss during handoffs.

The two remaining context-loss incidents occurred during the first month of implementation when team members occasionally forgot to document their investigation steps in the timeline. After reinforcing the documentation practice, context loss effectively reached zero.

85 Percent Reduction in Off-Hours Wake-Ups

Off-hours wake-ups for engineers outside their on-call shift decreased from 8-10 per week to 1-2 per week (85 percent reduction). The follow-the-sun model meant the APAC roster handled incidents during APAC business hours rather than waking US engineers at 3 AM.

The 1-2 remaining off-hours wake-ups per week were genuinely critical situations requiring specific expertise not available in the current on-call region. For example, a database schema issue requiring the architect who designed the system, even if they were not on-call.

The UK team no longer worked incidents alone for 8 hours before US colleagues came online. The handoff system ensured seamless transitions between regions.

Complete Communication Consolidation

Incident-related communication moved from 5+ scattered Slack channels to a single centralized timeline for each incident. This eliminated the time spent searching for information and ensured consistent visibility regardless of timezone.

The Singapore engineer investigating an incident created while the US team was online could see the complete communication history. Nothing was buried in a Slack channel they had not joined.

Engineers reported significantly reduced cognitive load during incidents. Instead of monitoring multiple channels and piecing together information, they had one place to track everything happening.

Improved Cross-Timezone Collaboration

The team tracked collaboration quality through incident retrospectives. Engineers reported feeling like “one coordinated team” rather than isolated regional groups. Real-time presence tracking meant everyone knew who was working on what.

The UK engineering manager observed: “Before, our team felt like we were handling incidents alone until the US woke up. Now, if we need help, we hand off to APAC or wait a few hours for Americas roster. The follow-the-sun coverage actually makes us more collaborative because we know someone is always available.”

Customer-facing incident communication also improved. Support could confidently tell customers when to expect resolution because the timeline showed active investigation with clear ownership.

What Made This Implementation Successful?

The distributed team’s incident coordination transformation succeeded because of several key factors beyond just the technology implementation.

Executive Sponsorship from VP Engineering

The VP Engineering championed the follow-the-sun model as a strategic initiative rather than an operational detail. They presented the business case to the executive team: customer incidents were taking too long to resolve due to coordination problems, impacting customer satisfaction and renewal rates.

With executive buy-in secured, the project received dedicated time allocation. Engineers were given the bandwidth to properly configure rosters, document runbooks, and build the service catalog rather than treating it as something to squeeze in between feature work.

The VP Engineering also set clear expectations: “We are a distributed team, and our incident response process must match that reality. Everyone participates in making this work.”

Phased 6-Week Implementation

Rather than switching overnight, the team implemented over six weeks in three phases:

Weeks 1-2: Focus on on-call roster configuration. Each region designed their roster with input from engineers about preferred rotation patterns. Holiday calendars were integrated and tested.

Weeks 3-4: Service catalog and incident workflow creation. Engineers documented their services, mapped dependencies, and wrote runbooks for common incident types. The team ran practice incidents to validate the workflows.

Weeks 5-6: Training and rollout. All 43 team members received training on the new follow-the-sun model, handoff protocols, and incident timeline usage. The team ran their first production incidents using the new system with close monitoring.

The phased approach prevented overwhelming the team and allowed time to adjust processes based on early feedback.

Regional Champions in Each Timezone

The engineering team identified one advocate in each timezone to serve as the regional champion: a senior engineer in San Francisco, an SRE lead in London, and the tech lead in Singapore.

These champions helped their regional colleagues understand the new processes, answered questions about the incident timeline, and provided feedback to improve the implementation. Having a local point of contact in each timezone accelerated adoption.

The champions also coordinated handoff protocols between their regions, ensuring smooth transitions and resolving any confusion about overlap periods or communication expectations.

Structured Handoff Protocols with 30-Minute Overlap

The team established clear handoff expectations between regions. During the 30-minute overlap period, the outgoing on-call engineer and incoming on-call engineer both monitored for new incidents.

If an incident was ongoing during handoff:

  1. Outgoing engineer posted handoff summary in incident timeline
  2. Brief synchronous conversation (5-10 minutes) to transfer context
  3. Incoming engineer confirmed understanding
  4. Outgoing engineer remained available for 15 minutes after official handoff for questions

If no incidents were active during handoff:

  1. Outgoing engineer posted shift summary in team channel
  2. Incoming engineer reviewed any incidents that occurred during previous shift
  3. Brief acknowledgment that handoff was complete

This structured approach eliminated the ambiguity that previously caused context loss during timezone transitions.

Monthly Retrospectives for Continuous Improvement

The team held monthly retrospectives to review incident response effectiveness and identify improvement opportunities. Topics included:

What is working well? Engineers shared positive experiences with the centralized timeline, faster expert location, and reduced off-hours wake-ups.

What needs improvement? Early retrospectives identified gaps in runbook documentation and confusion about escalation paths. The team addressed these issues iteratively.

How can we optimize handoff protocols? The team refined overlap timing based on actual incident patterns. They found that 30 minutes was optimal - long enough for meaningful context transfer but not so long that it felt inefficient.

This continuous improvement mindset prevented the new processes from becoming rigid or outdated as the team and infrastructure evolved.

Integration with Existing Tools

The incident coordination system integrated with the team’s existing tools rather than requiring complete replacement:

Slack: Alert notifications still came through Slack, but now directed people to the centralized incident timeline rather than scattered channel discussions.

Monitoring: Existing monitoring tools (Datadog for infrastructure metrics, custom application monitoring) continued generating alerts. The integration automatically created incidents in the coordination system.

Runbook storage: Engineers could keep runbooks in Confluence or Notion and link them from the incident system, or store them directly in the platform. This flexibility reduced migration friction.

The integration approach meant the team could adopt follow-the-sun coordination without discarding their existing tooling investments.

What Challenges Did They Face?

The implementation was not without obstacles. The team encountered several challenges during the rollout.

Initial Resistance to Documentation Overhead

Some engineers initially resisted documenting their investigation steps in the incident timeline, viewing it as overhead that slowed response. They were accustomed to jumping on a quick Slack call to resolve issues.

The regional champions addressed this by demonstrating concrete examples of how documentation saved time during handoffs. After engineers personally experienced the benefit when joining an incident hours after it started, adoption increased rapidly.

The VP Engineering also reinforced the expectation during team meetings: “Documentation is not overhead - it is coordination. We cannot be a high-functioning distributed team without it.”

Calibrating Rotation Algorithms

The initial FairDistribution rotation algorithm settings resulted in uneven workload distribution. Some engineers received on-call duty multiple weeks in a row while others had long gaps.

The engineering manager worked with the APAC champion to adjust the algorithm parameters, increasing the weight given to recent on-call history. After recalibration, the distribution became noticeably more fair, and the team reported higher satisfaction with rotation balance.

Time Zone Display Confusion

Early in the implementation, some engineers were confused by timezone display, particularly around daylight saving time transitions. An engineer in Berlin would see a schedule time that did not match what their San Francisco colleague saw, causing concerns about miscommunication.

The team addressed this by explicitly showing the timezone for every displayed time: “Next on-call shift: Mar 15, 2:00 PM CET (6:00 AM PST)“. This made it clear that everyone was looking at the same moment in time, just displayed in their local timezone.

The engineering team also held a brief training session on how IANA timezone identifiers work and why they are necessary for accurate distributed scheduling.

Service Catalog Initial Investment

Building the complete service catalog with 40+ services and their dependencies took significant time during weeks 3-4 of implementation. Engineers needed to document dependencies they had never formally mapped.

The team addressed this by making the catalog iterative. They started with their top 15 most critical services and rough dependency relationships, knowing they could refine later. This “good enough to start” approach prevented perfection from blocking the rollout.

Over the following three months, engineers gradually filled in missing services and corrected dependency relationships as they learned from production incidents.

What Would They Do Differently?

Six months after implementation, the engineering leadership reflected on lessons learned and what they would adjust for future distributed team initiatives.

Start with Service Catalog Earlier

The team wished they had invested in the service catalog before incidents forced them to. Understanding service dependencies improved more than just incident coordination - it helped with architectural decisions, capacity planning, and onboarding new engineers.

“If I could do it over, I would build the service catalog first, then layer incident coordination on top of it,” the VP Engineering reflected. “The dependency visibility is too valuable to treat as an incident response feature.”

More Explicit Escalation Paths

While the follow-the-sun model worked well for standard incidents, the team realized they needed clearer escalation paths for scenarios requiring specific expertise outside the current on-call region.

They retroactively documented escalation procedures: when to pull in the database architect even if they are not on-call, how to escalate customer-impacting incidents to management, and which scenarios require waking someone in another timezone.

Better Customer Communication Integration

The team initially focused on internal incident coordination but later realized they needed tighter integration with customer communication. Support team members needed visibility into incident timelines to provide accurate status updates.

They expanded access permissions to give support staff read-only visibility into customer-impacting incidents and built automated status page updates based on incident status changes. This improved customer communication consistency.

Earlier Investment in Runbook Quality

The initial runbooks created during weeks 3-4 were basic checklists. After several months of using them during real incidents, the team realized they needed more detail: specific thresholds to check, example commands to run, and decision trees for diagnosis.

“We should have invested more time in high-quality runbooks from the start,” the SRE lead commented. “The time we saved during incidents would have more than paid for the upfront documentation effort.”

The team subsequently ran a runbook improvement initiative, bringing engineers together to enhance the top 10 most-used runbooks with detailed procedures and decision logic.

How Does This Apply to Your Distributed Team?

The follow-the-sun incident management approach works for distributed teams facing coordination challenges across timezones. The implementation patterns are adaptable to teams of various sizes and geographic distributions.

Teams with 2-3 Timezones

Even teams with just two timezones (such as US and Europe) benefit from regional rosters and centralized coordination. The principles remain the same: each region provides coverage during their business hours, incidents hand off between regions with overlap periods, and a centralized timeline prevents context loss.

A smaller distributed team might start with two rosters (Americas and EMEA) with 4-6 engineers each, eliminating overnight on-call while maintaining coverage.

Teams Scaling to Global Distribution

Teams expanding from one timezone to multiple can implement follow-the-sun coordination proactively before coordination problems emerge. Starting with the service catalog and runbooks establishes good practices that scale as the team grows.

The phased implementation model works particularly well for scaling teams: establish regional rosters as you hire in new timezones, add service catalog documentation as you build or acquire systems, and layer in more sophisticated handoff protocols as coordination complexity increases.

Teams with Uneven Regional Distribution

The model adapts when regions have different team sizes. A team with 20 engineers in the US and 3 in Asia-Pacific can still implement follow-the-sun coordination with asymmetric rosters.

The APAC roster might provide primary coverage during Asian business hours with escalation paths to US engineers for issues requiring specific expertise. This still eliminates most off-hours wake-ups while acknowledging different regional capacities.

Teams with Remote-Only or Hybrid Models

The coordination approach works for fully remote teams (individuals distributed across timezones) and hybrid teams (some co-located offices plus remote individuals). The centralized incident timeline benefits everyone regardless of whether they are working from home or an office.

For hybrid teams, the service catalog and dependency mapping prevent the knowledge silos that often form when some information only exists in hallway conversations at co-located offices.

Teams Using Different Tool Ecosystems

The follow-the-sun principles apply regardless of specific tools. The key elements are:

  1. Regional on-call rosters with timezone-aware scheduling
  2. Centralized incident coordination accessible to all timezones
  3. Service catalog with dependency mapping
  4. Consistent runbook procedures
  5. Structured handoff protocols

These can be implemented in various platforms as long as they support multi-timezone scheduling, centralized communication, and mobile access for distributed team members.

What Should You Focus on First?

If your distributed team is experiencing incident coordination challenges similar to TechFlow Solutions, prioritize these implementation steps based on their impact and difficulty:

Immediate (Week 1): Establish Centralized Incident Communication

Move incident coordination out of scattered Slack channels into a centralized location. This is the highest-impact change and requires minimal upfront investment.

Create a standard practice: when an alert fires or someone reports an issue, create an incident in the centralized system and communicate there. Everyone should be able to see ongoing incidents and join the coordination.

This single change eliminates the communication scattering problem and provides immediate value.

Near-Term (Weeks 2-4): Configure Follow-the-Sun Rosters

Map your existing team members to regional rosters aligned with their business hours. If you have US and Europe coverage, create two rosters. If you span more timezones, create additional regional rosters.

Configure each roster with the appropriate IANA timezone identifiers and integrate holiday calendars for automatic holiday avoidance.

Test the handoff protocols with practice scenarios before transitioning production incidents to the new model.

Medium-Term (Weeks 5-8): Build Service Catalog

Document your most critical 10-15 services with their dependencies. You do not need complete documentation of every service to start seeing value.

Focus on services that frequently have incidents or that are critical to customer-facing functionality. Map the basic upstream and downstream dependencies.

As you respond to incidents using the new coordination model, continue expanding the catalog when you discover gaps.

Ongoing: Refine Runbooks and Procedures

Create basic runbooks for your top incident types and improve them iteratively based on real incident experiences.

After each significant incident, ask: “Could a runbook have helped us diagnose this faster?” If yes, document the diagnostic steps for next time.

Hold monthly retrospectives to review what is working and what needs improvement in your coordination processes.

The key is starting with high-impact changes rather than trying to build the perfect system before using it. Each incremental improvement compounds over time.

Ready to Coordinate Incidents Across Timezones?

See how Upstat enables follow-the-sun incident management for distributed teams with multi-timezone on-call scheduling, centralized coordination, and automated handoffs.