You probably noticed a large number of website and internet-connected services went offline earlier this week, the result of an outage from Amazon Web Services' (AWS) S3 section. Now, the online retail giant and cloud service provider has provided an explanation as to how it happened. The short answer: a simple typo.
Amazon apologized for the disruption on the AWS services page. It writes that a Simple Storage Service (S3) engineer was debugging an issue causing the S3 billing service to run slowly. They "executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process."
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region.”
While the S3 subsystems are designed to work even when a number of them fail, Amazon hadn’t restarted the indexing and placement parts for many years. Additionally, S3 has experienced huge growth over the last several years, both of which meant "the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.” Four hours and 17 minutes, to be exact.
Sites and services that rely on Amazon’s Northern Virginia data center region were affected by the outage. It’s estimated that up to 100,000 websites were down because of the error, including Business Insider and Medium. Automation service ifttt, Trello, and websites created with Wix were also taken offline.
The company said it had added safeguards to prevent the same thing happening in the future. “We will do everything we can to learn from this event and use it to improve our availability even further," Amazon wrote.