Typo Set In Motion Chain Of Events That Shut Down AWS S3 Cloud

You hear about accident investigations on a regular basis. When an airliner goes down, or a train comes off the rails or any other serious accident, an investigation starts along with the grim task of recovering the dead and injured.

Usually, there will be a briefing by the investigating authority at the start and then you won’t hear anything for months. Few people know is what the investigators are even looking for.

That’s because it can take months for the investigators to go through every detail before determining what caused the accident.

Inside AWS outage

The investigations are elaborate because there’s rarely a single cause to a serious accident. Eventually the investigation will show that a sequence of events occurred and it’s possible that the accident could have been prevented if any one of those event had changed.

Investigations of this type actually happen for accidents of all sorts, not just transportation catastrophes. Companies and regulators follow similar procedures for a wide variety of unplanned events.

In fact, companies will launch such an investigation when an accident causes a major loss, such as the outage that took out Amazon Web Services and its S3 storage services on February 28, which explains why the company undertook one.

I observed this first-hand in the late spring of 1971, when I was sent up a mountain near Roanoke, Virginia, to cover an airplane crash for the television station where I’d just started working. On that mountain, World War II hero and Hollywood actor Audie Murphy and five others had died as the airplane in which they were riding slammed into the top of a fog shrouded mountain.

Around me as I climbed the side of the mountain with the rest of the news crew were representatives from the National Transportation Safety Board, already taking photos and making measurements of the crash site. Later, they would take all the components they could find of the shattered aircraft to a hanger for examination and further investigation.

Investigation

To me, as I reported from that mountainside, the reason for the crash seemed obvious. The pilot must have been lost in the fog, and failed to see the mountain. But the truth was much more complicated than that.

The investigators had to learn why the pilot been lost like that near a major airport? Why hadn’t he performed an instrument landing at the major airport nearby after the weather had turned bad? The questions were eventually answered, and ultimately a lesson was learned.

Fortunately, not every accident results in tragic deaths. But every serious accident must be investigated to learn how it happened and how it can be prevented from happening again.

This was the case with the Feb. 28 event when Amazon Web Service’s S3 storage services shut down for hours. This time the losses measured not in lives, but in millions of dollars lost by Amazon and clients because of the down time. Clearly an investigation was in order.

But as Amazon explained in a report it released on March 2 along with an apology to its customers, it was of chain of events that started with the smallest of errors, a typo in a server update command.

Originally published on eWeek

Page: 1 2

Wayne Rash

Wayne Rash is senior correspondent for eWEEK and a writer with 30 years of experience. His career includes IT work for the US Air Force.

Recent Posts

Google Consolidates DeepMind And AI Research Teams

AI push sees Alphabet's Google saying it will consolidate its AI teams in its Research…

4 hours ago

Apple Pulls WhatsApp, Threads From China App Store

Beijing orders Apple to pull Meta's WhatsApp and Threads from its Chinese App Store over…

7 hours ago

Intel Foundry Assembles Next Gen Chip Machine From ASML

Key milestone sees Intel Foundry assemble ASML's new “High NA EUV” lithography tool, to begin…

11 hours ago

Creating Deepfake Porn Without Consent To Become A Crime

People who create sexually explicit ‘deepfakes’ of adults will face prosecution under a new law…

1 day ago

Google Fires 28 Staff Over Israel Protest, Undertakes More Layoffs

Protest at cloud contract with Israel results in staff firings, in addition to layoffs of…

1 day ago