Categories: CloudServer

Amazon Reveals Human Error Was Behind S3 Service Outage

Eariler this week, Amazon’s Simple Storage Service (S3) experienced an outage on the East coast of the US, affecting several high-profile sites including Adobe, Slack, Splitwise and the US Securities and Exchange Commission.

At the time Amazon declined to disclose the reason for the outage, but the company has now published a summary of the incident and revealed exactly what caused the disruption.

The outage was ultimately down to human error. While trying to fix a bug in the S3 billing system, a team member entered an incorrect command which ended up removing a larger-than-intended set of servers, inadvertently impacting other S3 subsystems.

S3 outage

Specifically, the servers that were removed supported the ‘index’ and ‘placement’ subsystems, causing both of them to require a full restart. During this period S3 was unable to service requests and other AWS services that rely on S3 for storage were also impacted while the S3 APIs were unavailable.

“S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact,” Amazon explains. “We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes.

“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.”

Amazon goes on to say that, due to AWS’ recent growth, the process of running the safety checks and restarting the servers took “longer than expected”.

The company also said it will be making “several changes” as a result of what it calls this “operational event”. The main issue was that the removal of capacity tool allowed too much capacity to be removed too quickly, so Amazon has added safeguards which “prevent capacity from being removed when it will take any subsystem below its minimum required capacity level”.

Furthermore, changes will be made to improve the recovery time of S3 subsystems, primarily through the further partitioning of services into small portions called ‘cells’ which allow engineering teams to assess and test recovery processes.

Finally, there was an apology: “While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”

Not alone

AWS is not the only service provider to have experienced an outage this week. On Monday HSBC customers were left frustrated as an outage affected the bank’s Business Internet Banking service in the UK for around four to five hours.

This was followed by an outage at internet domain registrar and web hosting company GoDaddy, affecting its infrastructure services relating to the provision of is domain name and website services.

The impact of these outages highlight the dependence businesses and consumers have on web and cloud services and emphasise the importance of disaster recovery in an ever-more connected world.

Are you a cloud aficionado? Try our quiz!

Sam Pudwell

Sam Pudwell joined Silicon UK as a reporter in December 2016. As well as being the resident Cloud aficionado, he covers areas such as cyber security, government IT and sports technology, with the aim of going to as many events as possible.

Recent Posts

Microsoft Executive Indicates Departmental Hiring Slowdown

Amid concern at the state of the global economy, a senior Microsoft executive tells staff…

2 days ago

Shareholders Sue Twitter, Elon Musk For Stock ‘Manipulation’

Disgruntled shareholders are now suing both Twitter and Elon Musk, over volatile share price swings…

2 days ago

Google Faces Second UK Probe Over Ad Practices

UK's competition watchdog launches second investigation of Google's ad tech practices, and whether it may…

2 days ago

Elon Musk Raises His Contribution To Twitter Acquisition

But one of Elon Musk's biggest backers on the Twitter board has tendered his resignation…

3 days ago

Broadcom Confirms VMware Acquisition For $61 Billion

Entry into cloud infrastructure software for US chip firm Broadcom after it confirms reports it…

3 days ago