Mimecast blames a ‘network infrastructure failure’ for outages that have dented its 100 percent SLA guarentee
There were red faces at Mimecast this week, after the software as a service company admitted that a network infrastructure problem had caused email outages.
The company specialises in offering cloud-based email management services to businesses, and the embarrassing failure has left its business continuity pledge looking decidedly tarnished.
Data centre glitch
Problems began on Thursday morning (11am) when customers where unable to send or receive email. The fault lasted three hours until 2pm Thursday, but there are some ongoing issues as of Friday morning as the company sought to address “some residual mail queues where a large backlog existed overnight.”
When the outage occurred on Thursday morning, Mimecast quickly responded and admitted there was a issue caused by a hardware network failure at its data centre in Woking, Surrey.
“First, an apology,” wrote Orlando Scott-Cowley at Mimecast on Thursday in a blog posting. “Today Mimecast UK customers have experienced problems with our email services, caused by a network hardware failure at our Woking data centre. Our infrastructure teams have identified and isolated the problem, and are bringing all affected customer systems back online now.”
The seriousness of the situation was not lost on Mimecast’s management however, as the company has not been shy in promoting its 100 percent SLA (service level agreement) for customers. “Mimecast has all the redundancy and replication in its geographically dispersed data centers in the cloud, backed by 100 percent service availability SLA,” the company has previously said.
The embarrassing lapse prompted Mimecast’s co-founder and CEO Peter Bauer to quickly respond.
“I wanted to take this opportunity to say sorry personally and on behalf of Mimecast to our customers and partners affected by this issue today,” Bauer wrote. “For three hours today we did not live up to our availability promise. We are very sorry.”
Bauer explained that over the last ten years the company have not had any significant outages because of its “infrastructure and because of the constant scenario planning we conduct to ensure we’re mitigating against any points of failure.”
“As a cloud vendor, our platform infrastructure works in an active-active model, where communications are handled by all sides of our grid,” he wrote. “If there is any unavailability in a component another part of the grid can take over. Failing over an entire data centre happens extremely rarely and we deliberately do it manually as an automatic failover of this scale brings significant risks. The plans we had in place underestimated the time it would take to complete the task. We aim for under 30 minutes, however this one took us over 2 hours.”
Bauer admitted that the company would be reviewing this procedure, and would make sure it could do it ‘much faster’ if it ever happened again.
“In terms of next steps, we will of course honour our SLA obligations, and we’ll be in touch proactively with all affected customers on this issue in the coming days,” said Bauer. “We appreciate the patience that many of our customers have shown during this tough day and we will be working extra hard to ensure it doesn’t happen ever again.”
Mimecast uses a massively-parallel grid infrastructure for email storage and processing through geographically dispersed data centres, which it says enables it to guarantee 100 percent uptime on all services.
But the outage raises serious question marks over whether it is really possible for a cloud service provider to fully deliver on a 100 percent SLA promise. It should be noted that Mimecast is not alone in touting a cloud-based 100 percent SLA guarantee.
“Service level management will be crucial,” John Manley, director of cloud services at HP Labs in Bristol, UK, told TechWeekEurope last year, speaking about the demands on cloud service providers by customers.
Can you rely on your cloud knowledge? Take our quiz!