Human error revealed to be behind S3 outage as Amazon promises to make changes and learn from the event
Eariler this week, Amazon’s Simple Storage Service (S3) experienced an outage on the East coast of the US, affecting several high-profile sites including Adobe, Slack, Splitwise and the US Securities and Exchange Commission.
At the time Amazon declined to disclose the reason for the outage, but the company has now published a summary of the incident and revealed exactly what caused the disruption.
The outage was ultimately down to human error. While trying to fix a bug in the S3 billing system, a team member entered an incorrect command which ended up removing a larger-than-intended set of servers, inadvertently impacting other S3 subsystems.
Specifically, the servers that were removed supported the ‘index’ and ‘placement’ subsystems, causing both of them to require a full restart. During this period S3 was unable to service requests and other AWS services that rely on S3 for storage were also impacted while the S3 APIs were unavailable.
“S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact,” Amazon explains. “We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes.
“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.”
Amazon goes on to say that, due to AWS’ recent growth, the process of running the safety checks and restarting the servers took “longer than expected”.
The company also said it will be making “several changes” as a result of what it calls this “operational event”. The main issue was that the removal of capacity tool allowed too much capacity to be removed too quickly, so Amazon has added safeguards which “prevent capacity from being removed when it will take any subsystem below its minimum required capacity level”.
Furthermore, changes will be made to improve the recovery time of S3 subsystems, primarily through the further partitioning of services into small portions called ‘cells’ which allow engineering teams to assess and test recovery processes.
Finally, there was an apology: “While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”
AWS is not the only service provider to have experienced an outage this week. On Monday HSBC customers were left frustrated as an outage affected the bank’s Business Internet Banking service in the UK for around four to five hours.
This was followed by an outage at internet domain registrar and web hosting company GoDaddy, affecting its infrastructure services relating to the provision of is domain name and website services.
Are you a cloud aficionado? Try our quiz!