Compute Engine bug rerouted all incoming global traffic for 18 minutes
Less than a month after Google was bigging up its cloud prowess and announcing a significant data centre expansion, the company has had to publicly apologise to its customers for a global cloud outage.
Google Cloud’s compute service spluttered offline in all regions for 18 minutes on Monday evening, Pacific Time.
At 19:09pm, all inbound traffic from customers using Google’s Compute Engine instances was routed incorrectly, resulting in dropped connections and no way to reconnect.
“We recognise the severity of this outage, and we apologise to all of our customers for allowing it to occur,” Google said in a statement.
“As of this writing, the root cause of the outage is fully understood and GCE is not at risk of a recurrence.”
Google has taken the incident very seriously, and has offered 10-25 percent discounts of affected customers’ monthly cloud bills, more discount than it contractually offers through its service level agreements.
Other Google cloud applications, such as Google App Engine and Google Cloud Storage were not affected.
Importantly, this kind of outage is costly for Google. As more customers move to the cloud, companies like Google have to be seen to be working to prevent these risky occurrences.
To that end, Google said it has already learned lessons from the slipup, and has put into place an action plan to make sure it doesn’t happen again.
“It is our intent to enumerate all the lessons we can learn from this event, and then to implement all of the changes which appear useful,” said Google.
“As of the time of this writing in the evening of 12 April, there are already 14 distinct engineering changes planned spanning prevention, detection and mitigation, and that number will increase as our engineering teams review the incident with other senior engineers across Google in the coming week.”