Why should we study major tech outages from other companies?

Studying major outages helps you learn from others mistakes without experiencing the pain yourself. These incidents reveal patterns in system failures, highlight blind spots in monitoring, and demonstrate the importance of practices like gradual rollouts and dependency management.

What are the most common causes of major tech outages?

The most common causes include configuration errors, cascading failures from dependent services, insufficient testing of updates, BGP routing issues, and inadequate rollback procedures. Many outages also stem from automation that lacks proper safety checks.

How can we prevent repeating the same mistakes as other companies?

Document and review major industry outages with your team, implement gradual rollout strategies for changes, establish clear dependency management, maintain comprehensive monitoring across all critical paths, and create runbooks based on lessons learned from both your incidents and industry failures.

What should we track after an incident to improve our systems?

Track Mean Time To Resolution, incident severity patterns, affected dependencies, root cause categories, time to detection, communication effectiveness, and action items from post-incident reviews. This data reveals trends and helps prioritize reliability improvements.

Learning from Major Tech Outages: AWS, Cloudflare, CrowdStrike

On July 19, 2024, a faulty CrowdStrike update crashed approximately 8.5 million Windows machines worldwide, grounding flights, disrupting hospitals, and causing over $5 billion in direct losses to Fortune 500 companies alone. Just over a year later, in October 2025, a race condition in AWS brought down thousands of websites and applications for 15 hours, disrupting everything from hospital networks to mobile banking. For many organizations, these were the worst outages they had ever experienced. Yet both incidents were preventable.

These outages join a growing list of major tech failures that offer invaluable lessons for engineering teams everywhere. When we study these high-profile incidents, we gain insights that would take years to accumulate through our own experiences alone. More importantly, we can avoid repeating the same costly mistakes.

Why Learning from Outages Matters

Every major outage represents millions of dollars in lost revenue, damaged customer trust, and countless engineering hours spent on recovery. But these incidents also generate something valuable: hard-won knowledge about how complex systems fail under real-world conditions.

The question is not whether your systems will fail, but when. By studying how other companies’ systems have failed, you can identify vulnerabilities in your own infrastructure before they cause production incidents. This proactive approach to reliability engineering separates teams that merely react to problems from those that systematically prevent them.

Industry giants like AWS, Google, and Microsoft publish detailed postmortem reports precisely because they understand this value. These documents are not just exercises in transparency; they are educational resources that help the entire industry improve.

CrowdStrike’s Global Windows Crash

In July 2024, CrowdStrike pushed a sensor configuration update to its Falcon platform that contained a logic error. Because the update was deployed globally without gradual rollout safeguards, millions of Windows systems received the faulty configuration simultaneously and entered boot loops.

Key Lessons:

Gradual rollouts are non-negotiable for critical updates. CrowdStrike’s update went to all systems at once, eliminating any opportunity to catch the problem before it reached catastrophic scale. Modern deployment practices use canary deployments and progressive rollouts precisely to prevent this scenario.

Testing in pre-production environments cannot catch everything. The faulty configuration passed all pre-release testing but failed catastrophically in production. This highlights the importance of gradual production rollouts as an additional safety layer beyond testing.

Rapid rollback mechanisms are essential. The incident’s impact was magnified because affected systems could not boot normally, making automated rollback impossible. Critical systems need safe mode recovery paths and the ability to revert problematic changes without requiring manual intervention on every affected machine.

Dependencies create blast radius. Organizations that relied on CrowdStrike for endpoint security suddenly lost access to critical systems. This demonstrates how third-party dependencies can become single points of failure that amplify incident impact far beyond your direct control.

Sources:

Preliminary Post-Incident Review - CrowdStrike official report
2024 CrowdStrike-related IT outages - Wikipedia comprehensive overview
CrowdStrike outage: What caused it and what’s next - TechTarget analysis

Cloudflare’s BGP Hijacking Incident

On June 27, 2024, Cloudflare’s 1.1.1.1 DNS resolver experienced an outage caused by BGP hijacking. A third-party network (AS267613) incorrectly announced Cloudflare’s IP addresses, causing traffic to be misrouted to the wrong destination.

Key Lessons:

BGP routing is vulnerable to external actors. BGP relies on trust between networks. When a network incorrectly announces routes it does not own, traffic can be misdirected globally. This incident affected 300 networks across 70 countries despite originating from a single source.

RPKI provides essential protection. Cloudflare’s internal use of Resource Public Key Infrastructure (RPKI) prevented the invalid routes from affecting their own network routing. However, not all networks implement RPKI, which allowed the hijack to propagate externally.

Rapid response requires monitoring external routing. Cloudflare detected the hijack around 20:00 UTC and resolved it approximately two hours later by disabling peering sessions with problematic networks. Monitoring BGP announcements globally is essential for detecting routing anomalies.

Most-specific route preference creates vulnerability. The hijacker’s /32 announcement was more specific than Cloudflare’s /24, causing networks to prefer the malicious route. This fundamental BGP behavior makes prefix hijacking technically straightforward despite being illegitimate.

Sources:

Cloudflare 1.1.1.1 incident on June 27, 2024 - Official Cloudflare postmortem
BGP Hijacking incident analysis - Internet Society analysis

Cloudflare’s November 2025 Global Outage

On November 18, 2025, Cloudflare experienced its most significant outage since 2019, taking down major sites including ChatGPT, X (Twitter), and numerous crypto platforms. Core traffic normalized within a few hours, though full restoration took longer. The root cause was a database permission change that corrupted a Bot Management feature file, which then propagated globally.

At 11:05 UTC, a permission change to a ClickHouse database altered how queries returned metadata. A query used by Bot Management lacked proper database filtering and began returning duplicate column metadata, effectively doubling the feature file size. When this oversized file exceeded a hard memory limit of 200 preallocated features, the FL2 proxy code triggered an unhandled panic, cascading to HTTP 5xx errors across core CDN traffic.

Diagnosis was complicated because the ClickHouse cluster was gradually rolling out the permission changes. Every five minutes, either good or bad configuration files generated randomly, creating fluctuating errors that initially suggested a distributed attack rather than a configuration issue.

Key Lessons:

Validate internal configuration files like external inputs. Cloudflare committed to treating internally-generated configuration files with the same validation rigor as external inputs. The feature file that caused the outage was trusted implicitly and crashed the system when it exceeded expected parameters.

Deploy global kill switches for features. Cloudflare is adding kill switches for all features after this incident. The Bot Management feature had no quick way to disable it globally when problems emerged.

Limit resources consumed by error reporting. Cloudflare found that error reporting consumed excessive resources during the incident, compounding the original problem. They committed to preventing this in the future.

Frequent status updates build trust during outages. Cloudflare posted updates constantly throughout the incident, even when they had no new information to share. This demonstrated active engagement and reassured customers that teams were working the problem.

Funnel communication through one official channel. During any major incident, all external updates should flow through a single source of truth to prevent confusion.

Sources:

Cloudflare outage on November 18, 2025 - Official Cloudflare postmortem
Cloudflare says outage that hit X, ChatGPT resolved - CNBC coverage

AWS US-East-1 Outages

AWS’s US-East-1 region has experienced several high-profile outages over the years, revealing patterns about dependency management and cascading failures in distributed systems. The October 2025 incident stands out as particularly instructive.

October 2025: Race Condition Cascade

On October 20, 2025, AWS experienced one of its most severe outages when a race condition in US-East-1 triggered a 15-hour cascading failure. Two automated systems attempted to update the same data simultaneously, creating corrupted database entries that cascaded across multiple AWS services including DynamoDB, EC2, and Network Load Balancer.

Key Lessons:

Automated systems need proper synchronization. The race condition occurred because two automation systems lacked proper locking mechanisms when updating shared state. Concurrent modifications to critical data without synchronization safeguards can corrupt systems at scale.

Database corruption cascades unpredictably. The initial database corruption in DynamoDB created a cascading effect across dependent services. When foundational data becomes corrupted, recovery requires not just fixing the corruption but also identifying and repairing all downstream impacts.

Testing concurrent operations is essential. AWS added a new test suite specifically for concurrent operations after this incident. Systems that handle high concurrency need dedicated test scenarios that simulate simultaneous updates, not just sequential test cases.

Recovery time compounds with dependency depth. The 15-hour recovery period was magnified by the number of services that depended on the corrupted data. Each dependent service required validation and potential repair, extending the total downtime far beyond the initial fix.

Historical US-East-1 Patterns

Earlier US-East-1 outages in 2017, 2021, and 2023 revealed additional systemic vulnerabilities:

Internal service dependencies create cascading failures. In multiple incidents, problems with core services like identity management or networking caused failures to cascade across otherwise healthy services. When foundational services fail, everything built on them becomes unavailable regardless of its own health.

Monitoring must be independent of the systems being monitored. During some outages, status pages and monitoring systems were themselves affected, leaving customers without visibility into the incident. Critical observability infrastructure should run independently from production systems.

Regional concentration creates risk. US-East-1 hosts a disproportionate number of services because it was AWS’s first region and offers the broadest service availability. This concentration means outages there have outsized impact compared to newer regions with fewer customer workloads.

Multi-region architecture requires active-active design. Simply deploying resources in multiple regions is insufficient. True resilience requires active-active architectures where workloads can shift between regions without manual intervention when problems occur.

Sources:

AWS Post-Event Summaries - Official AWS incident reports
AWS Outage Analysis: June 13, 2023 - ThousandEyes technical analysis
How a tiny bug spiraled into a massive outage - CNN analysis of October 2025 race condition
AWS US-EAST-1 Outage Oct 2025 Analysis - Technical deep dive

Common Patterns Across Major Outages

When you analyze dozens of major tech outages, clear patterns emerge:

Configuration errors cause disproportionate damage. Whether it is BGP routing, deployment configurations, or security policies, misconfigurations consistently appear as root causes in major incidents. These errors often bypass normal testing because the configuration syntax is correct even when the semantics are dangerous.

Cascading failures amplify impact. A problem in one component triggers failures in dependent components, which trigger failures in their dependents, and so on. The original issue might affect 5% of capacity, but cascading effects can bring down entire systems.

Communication breakdowns compound technical problems. Many incidents become worse because teams cannot effectively coordinate during the crisis. Poor communication leads to duplicated efforts, conflicting actions, and delayed resolutions as responders work with incomplete information.

Insufficient rollback capabilities extend downtime. Systems that cannot quickly and safely revert problematic changes force teams into high-pressure forward fixes. This extends incident duration and increases the risk of additional errors introduced during recovery.

Building a Learning Culture

The value of studying outages depends entirely on what you do with the lessons. Engineering teams that systematically learn from incidents, both internal and external, build more reliable systems over time.

Regular outage review sessions bring teams together to discuss major industry incidents and extract applicable lessons. These sessions should focus on system patterns rather than blame, asking “could this happen here” rather than “who made this mistake.”

Track incident patterns to identify recurring themes in your own operations. Platforms that provide incident tracking with severity classification make it easier to analyze your historical incidents and identify trends. When you see similar root causes appearing repeatedly, those become priorities for systemic fixes rather than one-off repairs.

Measure and improve MTTR through detailed analytics. Understanding not just that incidents occur but how long they take to resolve reveals opportunities for improvement. Teams using tools with built-in MTTR tracking can spot when resolution times are increasing and address the underlying causes.

Document lessons in runbooks so future responders benefit from past incidents. Historical incident data becomes invaluable when similar problems occur, helping new team members get up to speed quickly on known issues and proven solutions.

Turning Insights Into Action

Learning from outages is only valuable if it drives concrete improvements. After reviewing major industry incidents, ask yourself:

Could our deployment process prevent a CrowdStrike-style global rollout failure? Do we have gradual rollout mechanisms and the ability to halt deployments when errors appear?

Would our monitoring catch a Cloudflare-style configuration error before it propagated? Do we validate configurations before applying them broadly?

How resilient are we to AWS-style cascading failures? Have we identified and reinforced our critical dependencies?

The engineering teams that consistently build reliable systems are not those that never make mistakes. They are teams that systematically learn from failures, whether those failures happened to them or to others, and use those lessons to build progressively more resilient infrastructure.

Major tech outages are expensive learning opportunities. By studying them carefully and extracting actionable lessons, you can improve your systems without paying the full cost of the education.

Turn Incidents Into Learning Opportunities

Track incidents, analyze patterns, and measure MTTR with Upstat

Start Free Trial

Learning from Major Tech Outages

Major tech outages offer invaluable lessons for engineering teams. By studying failures from companies like AWS, Cloudflare, and CrowdStrike, we can identify patterns, avoid similar pitfalls, and build more resilient systems that withstand production challenges.

Why Learning from Outages Matters

CrowdStrike’s Global Windows Crash

Cloudflare’s BGP Hijacking Incident

Cloudflare’s November 2025 Global Outage

AWS US-East-1 Outages

October 2025: Race Condition Cascade

Historical US-East-1 Patterns

Common Patterns Across Major Outages

Building a Learning Culture

Turning Insights Into Action

Turn Incidents Into Learning Opportunities