What the November 2025 Cloudflare Outage Teaches Organisations About Resilience

CloudBusiness ContinuityDisaster Recovery

28 Nov

Cloudflare experienced a major global outage on 18 November 2025, starting at 11:20 UTC. Many websites and applications became unreachable, and users encountered 5xx errors when attempting to access services that typically run without interruption. Although Cloudflare restored core traffic by 14:30 UTC and completed full recovery of related systems by 17:06 UTC, this incident has been described as their most severe disruption since 2019.

It also revealed how a single mistake inside a large-scale platform can create immediate knock-on effects for organisations around the world.

What Caused the Outage

Cloudflare’s investigation found that the outage was triggered by a configuration change inside a database. That change unexpectedly caused a feature file used by the company’s Bot Management system to be duplicated. The file became significantly larger than intended and was then distributed automatically across Cloudflare’s network.

Cloudflare’s traffic proxy software has a strict size limit for that file. Once the oversized version reached the machines handling live traffic, the proxy processes crashed. This interrupted core CDN activity, access services, and various internal systems that support Cloudflare’s own dashboard and authentication flow.

Although the disruption initially appeared to be an external attack, it was actually a simple internal logic error that originated from a routine permissions update.

Why This Matters for Any Business Using Cloud Services

This outage highlights the dependency of modern organisations on a small number of infrastructure providers. A single configuration change in a supplier’s internal environment can immediately affect thousands of businesses across different sectors. Even teams that design their services with care are exposed to sudden failure if the platforms beneath them falter.

Many companies run key systems through Cloudflare, including DNS, CDN, API gateways, security controls, identity services and Workers KV. When the outage occurred, these services either slowed or stopped entirely. That created disruption not only for customer-facing websites but also for internal tools that rely on Cloudflare for routing and security.

The recovery phase itself carried its own pressure. Once connectivity was restored, traffic surged as cached queues cleared and users retried their actions. Any system that is not built to handle these spikes risks compounding the original outage.

Lessons and Recommendations for Technical Teams and Decision-Makers

Treat internal configuration changes with the same caution as production changes

The outage began with a database permissions change - it wasn’t a release of new code or a global infrastructure change. What this highlights is the need for strong gating, peer review and testing for any modification that flows into core systems.

Build resilience into your architecture

Avoid a design where one provider sits in the middle of all essential traffic. Introduce fallback paths where possible. Options include multi-provider DNS, a secondary CDN, diversified identity services or a failover path behind a separate filtering layer. These choices reduce the impact when a single supplier experiences issues.

Monitor configuration pipelines as closely as you monitor traffic

Most organisations focus on response times, uptime and throughput. Incidents like this remind us that internal metadata and configuration files can be equally important. Track changes in automation pipelines, file generation systems and synchronisation processes.

Prepare for the recovery surge

When a provider brings its services back online, the return to normality is rarely gradual. Logins spike. Queues empty at once. Caches rebuild rapidly. Systems should be designed to absorb this moment and recover without bottlenecks or secondary failures.

Conduct thorough, transparent reviews

Cloudflare published a detailed incident report outlining what went wrong and how it will prevent similar issues in future. Clear post-incident reviews support stronger long-term reliability. Teams should carry out the same discipline internally. Capture the timeline, decisions, failure points and long-term improvements.

What This Means for Organisations Supported by Fifosys

Events like this show how quickly the stability of an external service can influence business continuity. Even when your own systems are well-managed, an upstream outage can disrupt websites, applications, remote access tools, or automated workflows.

This is why our guidance emphasises layered resilience planning. Review dependencies regularly. Confirm that alternative routes exist for critical functions. Ensure that your staff know what to expect during upstream incidents and how to communicate the impact clearly.

Outages will always occur in a complex global infrastructure - whether due to accident or malice. The goal is to limit how much any single failure affects your organisation. Careful planning, strong architecture and continuous review make a significant difference.

If you would like support reviewing your resilience strategy or understanding your reliance on external providers, the Fifosys team can help strengthen your organisation’s approach to continuity and risk.