Amazon confirms data center power outage behind latest AWS hiccup

Humza

Posts: 1,026   +171
Staff member
A hot potato: The internet is bound to notice when the world’s biggest cloud service provider suffers from the slightest of disruptions. For the third time this month, Amazon had to deal with an AWS outage that affected services including Slack, Imgur, Epic Games, and Asana, among others.

It’s been a tough month for AWS and dependent businesses, following yet another disruption for Amazon’s cloud in December. While such outages tend to occur sporadically enough to be considered a minor inconvenience, having three outages in three weeks will likely raise a few eyebrows.

Dozens of major online platforms and services have been affected this month due to AWS outages. The first having occurred on Dec 7 due to a networking issue with AWS’ US-EAST-1 region that caused Amazon.com, Netflix, Disney Plus, Kindle, and Roku, among other services to remain inaccessible for users.

The second AWS outage happened ten days later for US-Western servers, briefly disrupting the likes of Twitch, Slack, DoorDash, Xbox Live, and PlayStation Network, while the most recent outage occurred again in the US-EAST-1 region on Dec 22.

The third outage, according to Amazon, was due to “a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region.” Briefly described, an Availability Zone (AZ) is one or more discrete data center(s) within a geographically separate AWS region engineered to be isolated from failures in other AZs, and has redundant power, connectivity and networking capability.

Businesses and platforms can always choose to host in multiple, geographically separate regions to avoid service disruption. However, doing so means absorbing the costs of multiple data centers in order to keep downtime to a minimum. Amazon's latest cloud outage lasted for about twelve hours, with AWS' health dashboard currently showing all services operating normally.

Permalink to story.

 
You know on my previous job so relatively recently (2019) as part of a potential project our team including my sup and his manager sat down for probably a full week (Not together but within the space of a month or so) with AWS sales reps to discuss our potential cloud strategy for Business Intelligence and maybe even starting a Data Warehouse.

I distinctly remember the sales pitch making very strong emphasis about how AWS was guaranteeing uptime and explaining in the case of our area how they ran 3 different regional data centers and then all of them interconnected to the wider national grid to ensure high availability no matter what.

But then stuff like this happens: I wonder if just about everyone is actually not paying for the high availability and in typical cloud provider fashion, it's so obtuse and confusing to set up than you need a full time consultant that specializes on just that cloud provider AND be prepared to pay A LOT MORE than their sales pitch claims, or if it's just a bunch of BS and they use all that jargon and literally hundreds of software products and thousands upon thousands of service variations as a way to basically just blame the customers for the fact that well, their super reliable high availability claims are mostly theoretical if you pay 10x to 100x their normal rates to ensure it and hire a full team of full time AWS dedicated employees to set it up and maintain it.

I thought clouds were supposed to save companies money and make them more reliable but we often see this outages and nothing of the sort is actually happening IRL.
 
SNIP
I thought clouds were supposed to save companies money and make them more reliable but we often see this outages and nothing of the sort is actually happening IRL.
Cloud computing can save companies millions. As for outages, are they really that "often". Amazon has had a bad month to be sure, but how many outages like this have happened? A quick check shows that since 2008 Amazon has had about 17 outages. Since 2016 they have only had one per year. I didn't look up the downtime for each outage but one outage a year in a subset of their cloud doesn't sound particularly horrible. It's not great mind you, but not the end of the world.

Companies running their own data centers have outages all the time and, perhaps, longer outages since they generally don't have the backup systems and infrastructure that a cloud provider has.
 
Cloud computing can save companies millions. As for outages, are they really that "often". Amazon has had a bad month to be sure, but how many outages like this have happened? A quick check shows that since 2008 Amazon has had about 17 outages. Since 2016 they have only had one per year. I didn't look up the downtime for each outage but one outage a year in a subset of their cloud doesn't sound particularly horrible. It's not great mind you, but not the end of the world.

Companies running their own data centers have outages all the time and, perhaps, longer outages since they generally don't have the backup systems and infrastructure that a cloud provider has.

One of the design principles is to assume everything fails and to plan accordingly. Unfortunately some things in AWS are not designed that way and thus the outage a few weeks ago in us-east-1 had a global impact. Any company that implements a resilient architecture though can handle when a singe AZ, let alone a singe data center, goes down. In theory. In practice it often isn't quite as fault tolerant as we would like..
 
One of the design principles is to assume everything fails and to plan accordingly. Unfortunately some things in AWS are not designed that way and thus the outage a few weeks ago in us-east-1 had a global impact. Any company that implements a resilient architecture though can handle when a singe AZ, let alone a singe data center, goes down. In theory. In practice it often isn't quite as fault tolerant as we would like..
I've been in the computing biz for some years now. I've worked with large and small companies building out data centers, disaster recovery sites and more. I've seen lots of "redundant" systems fail, usually due to one small component that was overlooked and not made redundant.

Of course there's always the, Oh ****, we never thought of that moments. Like when Hurricane Katrina ravaged NOLA. That's when people learned that putting your generator on the ground in an area that can flood wasn't the best idea. Or, the time they had an ice storm up in WA State and power was out for 10 days. No one ever thought they would need more than a week's worth of propane for the generator. DOH!
 
Back