ANALYSIS: The causes of the massive Amazon Web Services outages that knocked as many as 150,000 websites offline went way beyond a typo in a server command
You hear about accident investigations on a regular basis. When an airliner goes down, or a train comes off the rails or any other serious accident, an investigation starts along with the grim task of recovering the dead and injured.
Usually, there will be a briefing by the investigating authority at the start and then you won’t hear anything for months. Few people know is what the investigators are even looking for.
That’s because it can take months for the investigators to go through every detail before determining what caused the accident.
Inside AWS outage
The investigations are elaborate because there’s rarely a single cause to a serious accident. Eventually the investigation will show that a sequence of events occurred and it’s possible that the accident could have been prevented if any one of those event had changed.
Investigations of this type actually happen for accidents of all sorts, not just transportation catastrophes. Companies and regulators follow similar procedures for a wide variety of unplanned events.
In fact, companies will launch such an investigation when an accident causes a major loss, such as the outage that took out Amazon Web Services and its S3 storage services on February 28, which explains why the company undertook one.
I observed this first-hand in the late spring of 1971, when I was sent up a mountain near Roanoke, Virginia, to cover an airplane crash for the television station where I’d just started working. On that mountain, World War II hero and Hollywood actor Audie Murphy and five others had died as the airplane in which they were riding slammed into the top of a fog shrouded mountain.
Around me as I climbed the side of the mountain with the rest of the news crew were representatives from the National Transportation Safety Board, already taking photos and making measurements of the crash site. Later, they would take all the components they could find of the shattered aircraft to a hanger for examination and further investigation.
To me, as I reported from that mountainside, the reason for the crash seemed obvious. The pilot must have been lost in the fog, and failed to see the mountain. But the truth was much more complicated than that.
The investigators had to learn why the pilot been lost like that near a major airport? Why hadn’t he performed an instrument landing at the major airport nearby after the weather had turned bad? The questions were eventually answered, and ultimately a lesson was learned.
Fortunately, not every accident results in tragic deaths. But every serious accident must be investigated to learn how it happened and how it can be prevented from happening again.
This was the case with the Feb. 28 event when Amazon Web Service’s S3 storage services shut down for hours. This time the losses measured not in lives, but in millions of dollars lost by Amazon and clients because of the down time. Clearly an investigation was in order.
But as Amazon explained in a report it released on March 2 along with an apology to its customers, it was of chain of events that started with the smallest of errors, a typo in a server update command.
Originally published on eWeek