Categories: CloudServer

Amazon Reveals Human Error Was Behind S3 Service Outage

Eariler this week, Amazon’s Simple Storage Service (S3) experienced an outage on the East coast of the US, affecting several high-profile sites including Adobe, Slack, Splitwise and the US Securities and Exchange Commission.

At the time Amazon declined to disclose the reason for the outage, but the company has now published a summary of the incident and revealed exactly what caused the disruption.

The outage was ultimately down to human error. While trying to fix a bug in the S3 billing system, a team member entered an incorrect command which ended up removing a larger-than-intended set of servers, inadvertently impacting other S3 subsystems.

S3 outage

Specifically, the servers that were removed supported the ‘index’ and ‘placement’ subsystems, causing both of them to require a full restart. During this period S3 was unable to service requests and other AWS services that rely on S3 for storage were also impacted while the S3 APIs were unavailable.

“S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact,” Amazon explains. “We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes.

“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.”

Amazon goes on to say that, due to AWS’ recent growth, the process of running the safety checks and restarting the servers took “longer than expected”.

The company also said it will be making “several changes” as a result of what it calls this “operational event”. The main issue was that the removal of capacity tool allowed too much capacity to be removed too quickly, so Amazon has added safeguards which “prevent capacity from being removed when it will take any subsystem below its minimum required capacity level”.

Furthermore, changes will be made to improve the recovery time of S3 subsystems, primarily through the further partitioning of services into small portions called ‘cells’ which allow engineering teams to assess and test recovery processes.

Finally, there was an apology: “While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”

Not alone

AWS is not the only service provider to have experienced an outage this week. On Monday HSBC customers were left frustrated as an outage affected the bank’s Business Internet Banking service in the UK for around four to five hours.

This was followed by an outage at internet domain registrar and web hosting company GoDaddy, affecting its infrastructure services relating to the provision of is domain name and website services.

The impact of these outages highlight the dependence businesses and consumers have on web and cloud services and emphasise the importance of disaster recovery in an ever-more connected world.

Are you a cloud aficionado? Try our quiz!

Sam Pudwell

Sam Pudwell joined Silicon UK as a reporter in December 2016. As well as being the resident Cloud aficionado, he covers areas such as cyber security, government IT and sports technology, with the aim of going to as many events as possible.

Recent Posts

US To Ban Huawei, ZTE From Certifying Wireless Kit

US FCC seeks to ban Chinese telecom firms at centre of national security concerns from…

3 hours ago

Anthropic Launches Enterprise-Focused Claude, Plus iPhone App

Two updates to Anthropic's AI chatbot Claude sees arrival of a new business-focused plan, as…

4 hours ago

TikTok Viewed As Chinese Influence Tool By Most Americans – Poll

Most people in the United States view TikTok as a Chinese influence tool a poll…

19 hours ago

Ofcom Confirms OnlyFans Investigation Over Age Verification

UK regulator confirms it is investigating whether OnlyFans is doing enough to prevent children accessing…

19 hours ago

Ex Google Staff Fired Over Israel Protest File NLRB Complaint

Dismissed staff file complaint with a US labor board, and allege Google unlawfully terminated their…

20 hours ago

Tesla Axes Entire Supercharger Team, Plus Senior Executives

Elon Musk dismisses two senior Tesla executives, plus the entire division that runs Tesla's Supercharger…

22 hours ago