When the Cloud Sneezes, the Internet Catches a Cold

On Monday, 20th October, Amazon Web Services went down... and half the internet followed suit.

For a few hours, websites stalled, apps broke, and cloud dashboards everywhere lit up in red - far from an ideal start to a Monday morning, right?

From what AWS later confirmed, the issue stemmed from DNS resolution problems in their US-East-1 region (Virginia), which, in non-techy speak, just means that's the same region that so much of the internet leans on for identity, databases, and control-plane operations.

While AWS engineers worked through the chaos, thankfully, our Microsoft-based clients didn’t feel a flicker. Nothing slowed down. Nothing failed over. Business carried on as usual.

But watching the world’s biggest cloud falter was a valuable reminder for everyone else: resilience isn’t about brand names. Resilience is, in fact, about design.

The Hidden Fragility of “The Cloud”

The phrase “in the cloud” can sound like your data floats somewhere magical over our heads - you might even think it's in a little white fluffy cloud.

In reality, it's a lot less fantastical; it lives in racks, regions, and services built on layers of other dependencies.

When a foundational component, such as DNS, identity, or a control plane, encounters trouble, everything stacked on top of it wobbles.

Think of it like you're pulling one of the bottom, foundation bricks out of a Jenga tower, and you'll have a pretty good steer for what happened with AWS. One region hiccupped - or, to stick with this Jenga analogy, one rigid block was forced out - and because so many workloads rely on it, the effect was global, and it all came falling down. From airports to streaming platforms, and even mattresses, in some cases, things quietly broke.

So the question isn’t “How could this happen?” but “What would happen to us if it did?”

Single Provider, Single Point of Failure

Many businesses still run everything through a single cloud, single region, or even single service.

It’s easy. Until it isn’t.

At Fifosys, we're a Microsoft partner, which means we naturally steer towards building on Microsoft’s cloud. Yet, we'd never assume that Azure is immune to issues, and as such, we don't put all our eggs in one basket and box ourselves into having a single source of failure.

How do we do this? Well, our team of techies design things whilst asking questions such as:

  • If this region failed, what’s our route back to normal?

  • If that identity provider slows down, what does the user see?

  • If one dependency stops answering, do we degrade gracefully or just stop?

The answers shape real resilience, not just theoretical uptime.

Control Planes Matter More Than You Think

When people picture “cloud downtime”, they think of servers going dark. In truth, the more dangerous failures are the invisible ones: identity not authenticating, DNS not resolving, message queues stalling, and that’s precisely what crippled AWS.

Most affected workloads didn’t lose data or compute; they simply couldn’t find or authenticate to the right place. It’s like every service in your office suddenly forgetting each other’s phone numbers.

That’s why our architects map every dependency chain. We don’t just plan for a server outage; we plan for when the glue fails.

Multi-Region Isn’t Overkill. It’s Business Continuity.

You wouldn’t store all your backups in the same cabinet, so why put all your workloads in the same region?

Multi-region setups, or at least cross-zonal resilience, are no longer nice-to-haves. They’re the fundamentals.

For our Azure clients, that might mean mirrored services between UK South and UK West, for example, or global workloads split across Europe. Even a basic secondary environment can keep operations running while the world waits for a fix elsewhere.

It’s not about cost, it’s about confidence.

Testing, Not Trusting

Outages are a great litmus test of who’s ready. The difference between panic and poise is whether you’ve tested your plan or just written it.

During AWS’s incident, some organisations discovered their “automatic failover” wasn’t so automatic after all. Queues backed up, dashboards froze, and logins failed.

We build regular failover testing into client environments because that’s when you find the missed permissions, the expired key, or the process nobody remembered to update.

Failovers aren’t theory, they’re drills.

Communication Builds Calm

As mentioned in an earlier blog post that I was halfway through writing on Monday morning, when half the internet breaks (and given our current cyber threat landscape, which has seen the largest cyber attack in UK history not too long ago), rumours start spreading faster than DNS updates.

Chatter online immediately switched toward "what if it's a hack?" or “not another one!”, and given how many things went down because of it, their concerns (had it been a breach) could've been a complete catastrophe. Still, alarm bells were ringing, and even companies that weren’t affected had customers asking, “Are you down too?”

That’s why communication is a key component of resilience.

After major incidents, we reach out. We don't do it to sell, but to reassure.

It’s the calm after the chaos: explaining what happened, confirming why clients were unaffected, and showing what we’velearned from the incident.

In a crisis, silence breeds panic. So, we always abide by the mantra that clarity equals confidence. And confidence is what keeps your business steady when everyone else is scrambling.

The Real Lesson: Don’t Assume, Design

It might sound like a contradiction, but bear with me: the AWS outage isn’t an AWS problem. It’s a cloud reality check.

Every provider, even the biggest, has a weak spot. Every system has a dependency. Every business has a threshold. I mean, if they couldn't build a Death Star without a single source of failure, we can apply that downwards and see where we are.

We can’t stop outages from happening, but we can design systems that ride them out without missing a beat. Our clients didn't notice a thing on 20th October, and that's exactly how it should be.

Final Thought

Resilience doesn’t come from luck or logos. It comes from preparation, diversity, and testing. We're working 24/7/365 to ensure that the next time the cloud sneezes, your business doesn’t catch a cold.

Next
Next

Inside an Email Breach: The 16-Step Response Plan We Use at Fifosys