Google’s Gmail Outage Caused by Upgrade Error

Ironically the problem started during an upgrade designed to stop Gmail overloading, says Google

Google’s Gmail application was down, for most users, for nearly two hours on Tuesday 1 September, due to human error, the company said last night after its engineers fixed the issue.

Google took a small fraction of the Gmail servers offline to perform routine upgrades, routing traffic to other locations as it regularly does. But things went wrong at that point, according to an explanation posted by Ben Treynor, vice president of engineering and site reliability czar for Google:

“We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers—servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!” This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the Web interface because their requests couldn’t be routed to a Gmail server.”

Through internal monitors, Treynor said the Gmail engineering team was alerted to the failures within seconds and added several request routers online to make up for the dearth in capacity and distributed the traffic across the request routers. Gmail came back online around 2:30 p.m. PDT.

This lack of server capacity is ironic, considering that Google powers the world’s most popular search engine with more than 1 million servers. To ensure it doesn’t happen again, Google boosted request router capacity well beyond peak demand for extra juice when the application needs it.

Treynor also said Google is improving the failure isolation in the routers, so a problem in one data centre won’t affect servers in another facility. Moreover, he said that Google is taking steps to make sure that when the request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load to another data centre.

It’s also worth noting that when Gmail did go down, Google urged users to access it via the IMAP and POP mail protocols. Mail processing by these routes continued to work normally because these requests don’t use the same routers at Google.

“We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there’s a problem with the service,” Treynor added. “Thus, right up front, I’d like to apologise to all of you — today’s outage was a Big Deal, and we’re treating it as such.”

So are the Gmail users who use Gmail for their businesses. One user, called Donald told eWEEK blog, Google Watch: “I use G-Mail to run my CPA practice. This is a serious (huge) problem.”

Another user, Sergei, added: “This is a huge problem and an outrage. I demand immediate Gmail access. What is with those people?”

More than 1.75 million businesses pay Google $50 per user, per year for Gmail, which is the backbone of the Google Apps collaboration platform. Google has argued that its apps are more secure and reliable than running similar apps in-house. But users have little patience for a service that conks out on them, particularly when they are paying for the extra reliability and security, and some have called for the ability to run Gmail on their own servers.

The latest issue follows an outage in May, one in March, and a big outage in February, when Gmail went down for two and a half hours due to “unexpected side effects of some new code”. That event caused some to raise questions over cloud apps. But these last two issues were nothing compared with the August 2008 outage that took Gmail down for nearly 15 hours.

gmailoutage.jpg

Additional material by Peter Judge.