Azure Fail: Microsoft Says Security Certificates Were Updated, Sort Of

Azure-Dead-End-failure-down

Microsoft renewed SSL certificate, but Azure storage failed to use the new one

Microsoft has published a detailed explanation for the major failure of its cloud service, Windows Azure, just over a week ago, when users lost the the storage service, apparently because a security certificate wasn’t renewed. Microsoft has assured users that the certificate was in fact updated – more or less.

On 22 February, Windows Azure’s secure storage failed, and users saw an error message saying the SSL (secure sockets layer) security certificate had expired, so secure HTTPS communications would not work. Users were critical of such an apparently crass error, especially coming a year after the service broke because it failed to take into account 2012’s Leap Year.

windows Azure cloud lightning blue sky fail

Half an update isn’t an update

Microsoft has published a full account of the failure on the Windows Azure Team Blog, which explains that the certificate was in fact updated, but other errors meant the new certificate was not in use everywhere it should be before the old one ceased to work.

“While the expiration of the certificates caused the direct impact to customers, a breakdown in our procedures for maintaining and monitoring these certificates was the root cause,” says Microsoft’s general manager for Windows Azure Mike Neil in the posting. “Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system.”

Windows Azure uses several SSL certificates, and scans them automatically every week, flagging any that need updating, 180 days in advance, with repeated notifications going to the teams managing a given service. When a certificate is updated, the team creates a new “build” of the service and deploys it, and updates the certificate in a central Azure “secret store”, so all the users are on the new certificate.

Everything went according to plan for the Azure storage service – except for one thing. On 7 January, the team renewed the three certificates, for Windows Azure Storage Blobs, Tables and Queues, and created a new build of the services, which they then deployed.

However, they forgot to flag that updated build as an urgent release which contained certificate updates.

That meant the update got delayed behind updates which were apparently more time critical. On 22 February, when the certificates expired, it had still not actually been implemented.

To make matters worse, there was no warning of impending doom. Although the storage service was running with an expiring certificate, no alarm bells rang, because the certificate had actually been updated, and it all looked fine in the central secret store.

“Because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system,” says Neil.

What is Microsoft doing about it?

To prevent this error from happening again, Azure will now monitor the status of certificates in the endpoints of the service, instead of just in the secret store so certificates do not actually expire while in production.  “Any production certificate that has less than 3 months until the expiration date will create an operational incident and will be treated and tracked as if it were a Service Impacting Event,” says Neil.

Microsoft is also automating service updates, so it wouldn’t be possible to “forget” to flag the important ones. “We will also automate any associated manual processes so that builds of services that contain certificate updates are tracked and prioritized correctly,” says Neil. “In the interim, all manual processes involving certificates have been reviewed with the teams.”

Microsoft is also going to address that “single point of failure”. It is going to look for ways to “partition the certificates across a service, across regions and across time so an uncaught expiration does not create a widespread, simultaneous event,” Neil promised.

All in all, a thorough report, and a workable response – though troubled users of Azure might be wondering if there are any other procedural gremlins still to emerge from the service.

How trustworthy is your Microsoft knowledge? Try our quiz!