Typo Set In Motion Chain Of Events That Shut Down AWS S3 Cloud

netflix, AWS

ANALYSIS: The causes of the massive Amazon Web Services outages that knocked as many as 150,000 websites offline went way beyond a typo in a server command

While the typo may indeed have triggered the outage, by itself it should not have caused the disruption it did. Another important factor was massive growth of the S3 storage service that was greater than Amazon had expected.

This meant that the S3 service had not been scaled appropriately for its user load. Because it had grown larger than anticipating, a system restart took much longer than expected, which in turn kept the critical S3 systems from being restored as fast as expected.

Essentially S3 growth had outpaced Amazon’s ability to partition the system into smaller segments that could be restarted quickly. The storage service failures were compounded by the fact that some internal systems such as Service Health Dashboard, also depended on the S3 services that were down.

netflix, AWS

AWS outage

As a result, the dashboard was telling AWS customers the system was running normally even as their business critical web applications crashed and were inaccessible. The typo in a single command during a debugging attempt initiated a cascading series of failures that knocked the S3 services offline for hours.

But if S3’s underlying configuration problems didn’t exist, the typo would have been a minor occurrence, probably one that would have gone without notice. But that’s not what happened.

But the failure had one other effect that was equally remarkable. Amazon conducted a detailed investigation to determine what caused the outage, which should provide valuable lessons on how the company can avoid similar failures in the future even if Amazon’s cloud service continues to grow at its current break-neck pace.

What was even more remarkable was the way Amazon was transparent about the investigation and the causes of the system failure. And finally, as should be done following an investigation into a serious accident, Amazon turned its findings into a series of steps to try to ensure this specific failure would never happen again.

That final step of fixing all the things that made it possible for the accident to happen is not necessarily quick or easy. In many industries including the airlines where a single faulty part or a single action can cause catastrophic loss of life, problems often go unfixed for years while companies dither and regulators ponder.

It’s not just the airlines. For example, despite a number of fatal railway accidents over the past few years—Positive Train Control is nowhere near universal on U.S. railways, despite increasingly urgent recommendations by the NTSB.

So if Amazon can claim a victory out of this very expensive cloud system crash it’s that it quickly determined the entire chain of events leading to the accident and they immediately started fixing them all.

This is not to suggest that there will never be another Amazon Web Services outage. Any system so complex will eventually develop a new set of problem. But it’s safe to say that that the same sequence of events won’t happen again.

Originally published on eWeek

Quiz: Everything you should know about AWS!