Microsoft Offers Refund For Azure’s Leap Year Outage

Microsoft provides detail of the Leap Day Azure cloud outage and offered a 33 percent rebate to all customers

Microsoft has explained the Leap Year Bug which brought down its Windows Azure cloud service, and has promised a 33 percent credit to all customers whether or not they lost service.

Bill Laing, corporate vice president, server and cloud, at Microsoft made the admission in a blog post which revealed that the root cause was an issue with certificates. Although the problem was understood quickly, a cascade of interactions between different operations contributed to the enormity of the disruption.

Unhealthy servers

Laing explained that Azure’s servers are organised into “clusters” of about 1,000 whose health is monitored by platform software called the Fabric Controller (FC) in an effort to isolate certain classes of error.

Azure’s Platform as a Service (PaaS) functionality requires its tight integration with applications that run in virtual machines (VMs) through the use of a guest agent (GA) that it deploys into the VMs. Each server has a host agent (HA) that the FC uses to deploy application secrets such as SSL certificates with the GA to see if the VM is healthy or if the FC should take recovery actions.

These secrets are encrypted, so the GA creates a transfer certificate when it initialises. New transfer certificates are issued when a new VM is created, when a deployment scales out, or when a deployment updates its VM OS. A new certificate is also issued when the FC reincarnates a VM on a healthy server if it deems the old server to be “unhealthy”.

Cluster bomb

However when a certificate is issued, it is given a one year validity range starting from midnight UST. The Azure code generated the end-date by adding a one to the year, so any certificate issued on leap day (29 February) had a an invalid end date of 29 February 29 2013, .

If a certificate cannot be issued, the GA restarts the VM. On leap day, of course once restarted, the same problem happened, causing a series of restarts. If a VM restarts three times in quick succession, this causes the HA to decide there is a hardware problem; it reports the server to a ‘Human Investigate’ status. The HA closed those servers, and the FC attempted to reincarnate any VMs stored on the failed server, thus spreading the bug.

If a threshold is reached, the FC will move an entire cluster to HI status, allowing operators  to repair it, something which happened shortly after the bug was triggered at 4:00am PST on February 28 (00:00 UST February 29).

Many clusters were in the middle of the rollout of a new version of the FC, HA and GA, which meant the bug hit immediately. The bug was identified at 6:28pm PST and Microsoft disabled management functionality in all clusters by 6:55pm PST, the first time it has ever taken such action.

Improved transparency

Microsoft was able to restore service management to all clusters by 2:11am PST on 29 February, but a number of servers which were updating when the bug occurred were incompatible with the fix, meaning that another solution was required before all services were healthy by 2:15am PST on 1 March.

“We know that many of our customers were impacted by this event and we want to be transparent about what happened, what issues we found, how we plan to address these issues, and how we are learning from the incident to prevent a similar occurrence in the future,” commented Laing.

“Due to the extraordinary nature of this event, we have decided to provide a 33 percent credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted,” he added. “These credits will be applied proactively and will be reflected on a billing period subsequent to the affected billing period.  Customers who have additional questions can contact support for more information.”

How much do you know about cloud computing? Find out with our quiz