ANALYSIS: AWS Outage demonstrates importance of building redundancy into your critical cloud apps even when your service provider is reliable
If there was ever any question about Amazon Web Services’ critical role in keeping commercial web sites running smoothly, that question was answered definitively on Feb. 28 when part of the company’s S3 storage service went down. That outage took out dozens of Web services operating by companies ranging from Apple to Zendesk.
What frustrated many users is that Amazon’s AWS dashboard, which is supposed to report the operational condition of its web services, was reporting that everything was operating normally even when it clearly wasn’t. The reason for that is because the dashboard relies on Amazon’s S3 storage and was unable to receive updated information about the outage.
AWS acknowledged that there was a problem and promised to keep customers updated. But the updates stopped coming in mid-afternoon. The last Tweet from the AWS team was, “For S3, we believe we understand root cause and are working hard at repairing. Future updates across all services will be on dashboard.” Earlier, the company had promised updates on Twitter.
However, once the company got its S3 services running again in the Northern Virginia location where its data center is located, the Service Health Dashboard began reporting the conditions accurately.
At that point the services located in that data center status reports indicated the problem was fixed. AWS reported at 2:19 p.m. that “between 9:37 AM and 1:57 p.m. PST we experienced elevated error rates for API Gateway requests in the US-EAST-1 Region when communicating with other AWS services. Deploying new APIs or modifications to existing APIs was also affected. The issue has been resolved and the service is operating normally.”
A close examination of the dashboard indicates that some services at Amazon’s Northern Virginia location may still be marginal, but it appears that it’s pretty much operating normally otherwise.
So what actually happened to the Amazon S3 services? The company hasn’t been very forthcoming, but its comments about elevated error rates for API Gateway requests suggest that the problem is infrastructure related, meaning it’s probably a router problem.
But of course, that’s just a guess. But many of the recent mass outages of services such as airline reservation systems seem to boil down to router problems, so it’s reasonable to make that assumption. In addition, router updates are frequently the root cause of such problems. Amazon hasn’t said what the actual cause of the problem is, so it could be anything from a hacking attempt to a configuration problem. We just don’t know.
Plan for everything
One thing we do know is that AWS and its S3 service are part of the problem, but not because it’s unreliable. In fact, Amazon’s services have been so reliable that its customers have grown to depend on AWS probably more than they should. From the viewpoint of most customers, AWS simply never goes down, so they don’t feel a need to plan for an outage.
Except of course, when it does. Then as we saw customers are left hanging with few updates and fewer explanations. But as annoying as the lack of explanations might be, what customers really needed is to get back to work. That requires some planning.
The first stage in that planning has to be finding an alternate storage location for the items that you’re keeping in the S3 storage service. This could mean keeping backups in S3 storage in another region, or it could mean using another storage service entirely. That way, if the S3 storage goes down, you can seamlessly switch to the other service.
Ideally, Amazon could offer redundant storage as a part of their S3 offering, so that if the service goes down as it did on Feb 28, data requests would be automatically routed to another site. A potential problem with that plan is if the redundancy depends on information also stored in AWS, so that when the region goes down, then so does the redundancy.
But assuming that Amazon can avoid making that mistake, and I’m sure the company can, then it has a good way to protect their customers from making the same mistake of assuming that Amazon won’t ever go down.
A even better approach is to assume that AWS and all of your other cloud services will go down and then plan your approach to handle that. In reality, such an assumption is a good security practice. Redundancy is important in making sure that your data is always available without fail.
This is why state of the art data centers have redundant servers, redundant network routers and power. It’s also why they have more generators available to keep the data center running than they actually need.
Some data centers go beyond that in their quest for reliability, even to the extent of having redundant chilled water reservoirs so that a loss of system coolant is unlikely. Having redundant data repositories is just part of making sure you can deliver the information your customers need.
With AWS and its high level of reliability, it’s easy to forget such lessons, but they remain important.
Originally published on eWeek