Lessons Must Be Learned From Azure’s Leap Year Glitch

Chris Preimesberger preferred

Enterprises need to manage their own systems as if it were all on-premises says Chris Preimesberger

Nothing besmirches the reputation of cloud services more than a major outage like the one Amazon EC2 suffered last year and the one a red-faced Microsoft endured on Leap Year Day, 29 February.

Bad guys hacking into a system can happen to anybody, cloud or no cloud. You secure as best you can for something like that. But a total outage as the fault of a cloud application provider is another thing entirely.

Cloud-System Domino Effect

Microsoft confirmed late on 29 February that a service outage that affected its Azure cloud computing service was caused by a Leap Year bug. The outage apparently was triggered by a key server in Ireland housing a certificate that expired at midnight on 28 February.

That electronic control document hadn’t taken into account the extra day in the month of February the Western calendar adds every four years. It was simple human error, the single most common cause of computer errors.

When the clocks struck midnight, things went haywire, and a cloud-system domino effect took charge. A large number of Western Hemisphere sites and the UK government’s G-Cloud CloudStore were among the many stopped cold by the outage. Microsoft has been retracing its steps in finding out what exactly happened and hasn’t said very much yet, although it did report in an Azure team blog that the problem has “mostly” been fixed.

“The issue was quickly triaged, and it was determined to be caused by a software bug,” Bill Laing, corporate vice president of Microsoft’s Server and Cloud, wrote in a 29 February posting on the Windows Azure Team Blog. “While final root-cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year.”

Human Error

Microsoft engineers created a workaround, while still dealing with issues affecting some sub-regions and customers. According to the Windows Azure Service Dashboard, virtually all regions were back up and running by 1 March, with the exception of an alert for the Windows Azure Compute in the South-Central US region; that alert, posted the morning of 29 February suggested some issue with incoming traffic.

“This is a classic computer science problem,” Andres Rodriguez, CEO and founder of cloud gateway provider Nasuni, told eWEEK. Nasuni, a cloud storage front end, uses Azure, Amazon S3, Rackspace and other cloud storage providers as targets for its clients.

“It was a Leap Year problem. The dates were misadjusted. They did not factor in the Leap Year day. When things start in Ireland, they’re starting at GMT zero, and for the 29th of February, they were pointing at it like crazy. There was probably smoke coming out of that hall, like crazy.”

Rodriguez reminded eWEEK readers that only the compute layer of the Azure cloud crashed, and that the storage service portion—of which Nasuni itself is a customer—was not affected. Nasuni’s storage service is redundant across multiple cloud systems, so if one goes down, data is not affected.

In fact, Rodriguez said, IT managers might be remiss if they don’t take into account replicating their critical business data on stacks in at least two cloud service providers—for the very reason Azure illustrated on 29 February.

A Reason to Revisit the Big Picture

Soon, Microsoft will be fully back up and running, and the world that runs on Azure will get back to work. But there is cause to stop and consider the bigger picture.

We enjoy innumerable benefits of IT in this digital device-crazy world. But we also need to remember that there are also many Achilles heels in data systems that can be directly affected by hackers, environment events, power outages, sunspots, human error—the list is a long one.

As time moves on, we’re getting better at finding those holes and plugging them. But the fact is, we probably will never completely solve even one-quarter of all the security risks inherent in IT systems because there are simply too many variables, and humans, involved.

The bottom line here is very simple, but it’s taking awhile for many people to learn it: Each enterprise needs to manage its own system as if it were all on-premises—including all VPN networks, remote offices and devices, clouds and/or cloud services within it.

“The first thing to understand [about events like this] is that this changes nothing,” Andi Mann, longtime storage industry analyst who’s currently serving as chief cloud strategy guru at CA Technologies, told eWEEK after the Amazon outage in April 2011. The same applies to Microsoft’s boo-boo.

“Cloud will have downtime—it’s a fundamental issue. But you need to be ready for downtime, whether it’s your own infrastructure or cloud infrastructure. You need to understand what the risk is. It’s all just about risk management.”

Rodriguez said that “these cloud providers have humongous data centres, but your own application in that tremendous data centre still has to be written to handle a collapse of the compute layer in that data centre. You cannot hope that the cloud provider is going to do that for you.”

How well do you know the cloud? Take our quiz