Numerous AWS-hosted websites are back online following the outage late last week at an Amazon data centre
Most of the websites that were taken offline by last week’s outage at an Amazon data centre are now backup and running.
The outages began on 21 April at roughly 9.40am BST, after an Amazon Web Services (AWS) data centre in Northern Virginia caused disruptions in its EC2 cloud hosting service, which in turn was said to knock thousands of websites offline.
This included the social news website Reddit; the Twitter toolbox Hootsuite; the Q&A website Quora; and the location-based social networking website Foursquare.
Whilst the problems were said to have persisted for some websites for a number of days, others came back online relatively quickly.
At the time of writing there was no mention on the Foursquare blog about how long the outage affected them, but Hootsuite was more open with its customers and said that its sites had been down for approximately 15 hours.
Amazon reported 24 April on its AWS service health dashboard that the vast majority of affected volumes were now recovered. It has also promised to provide a “detailed post-mortem” on the root causes of the prolonged outage of its cloud services.
On the 25 April, it provided an update but warned of some ongoing issues.
“We have completed our remaining recovery efforts and though we’ve recovered nearly all of the stuck volumes, we’ve determined that a small number of volumes (0.07% of the volumes in our US-East Region) will not be fully recoverable. We’re in the process of contacting these customers.”
Amazon’s EC2 cloud service is proving popular with startups and SMBs as it offers them a relatively cheap way to get their portal online, and doesn’t require them to run their own infrastructure. Last October for example, Amazon offered potential users the option to run a free Amazon EC2 instance for a year.
Of course the downside is that it makes these startups almost totally reliant on an outside provider to keep their services running.
Last October Novell’s Director of Data Centre Management, Benjamin Grubin, warned that too much enterprise IT was simply not ready to be moved outside the perimeter, i.e. outside to public clouds.
“Public clouds were going to take some amount of time to be mature. They weren’t there yet,” he said.
But Gartner has dismissed concerns about the resilience of the cloud and said Amazon was still within its SLA (service level agreement), but that the incident should encourage companies to think about their system architecture.
“My belief is that this doesn’t do anything to the adoption curve – but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures,” said research VP Lydia Leong in a blog posting.
“Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centres, and infrastructure, here. They can and do fail,” wrote Leong.
“You have to architect your app to have continuous availability across multiple data centres, if it can never ever go down. Whether you’re running your own data centre, running in managed hosting, or running in the cloud, you’re going to face this issue,” she warned.
This sentiment was echoed by others.
“Proponents of cloud computing aren’t going to like the fact that Amazon had issues that resulted in outages among its customers’ sites, but the fact is that most insurers have their own outages when they host applications internally, in some cases with more frequency and severity than we’re seeing here with Amazon,” said Craig Weber, a senior VP of the Insurance Group at Celent, a Boston-based financial research and consulting firm. “This outage should focus the discussion on the relative reliability of various approaches, and the tradeoffs between them.”
“Of course, there are also lessons about being aware of the capabilities of your business partners,” Weber added. “Engaging with a SaaS vendor requires understanding things like their architecture, their disaster recovery capability, and similar issues because worst case scenarios always seem to emerge eventually.”
This is not the first time a failure has effected the Amazon Cloud. Back in May 2010, Amazon’s EC2 service suffered a power outage after one of its data centres failed to cope with a power switch-over following a car crash, which triggered a local blackout.
That failure occurred when a car reportedly crashed into a utility pole near one of Amazon’s data centres on the east coat of America. The Amazon data centre apparently went offline after a transfer switch failed to properly manage the move from utility power to the facility’s own generators. It resulted in some Amazon customers in its US East Region losing service for about an hour.
And previous ly, in June 2009, one of Amazon’s data centres was struck by lightening.