Facebook Crashes Twice In A Comedy Of Errors

Facebook went down on Thursday for two and a half hours because of a mishandling of an error condition in the social network’s system.

Web performance management company AlertSite logged that the site availability dropped to 38.46 per cent yesterday evening. Robert Johnson, director of software engineering at Facebook, wrote an apology to the affected users and detailed the problem.

Errors Flagging Errors

Basically, a routine used to handle invalid data found during error-checking was itself interpreted as in error. This caused the system to try to replace it. It could only use replacement code that was the same as the flagged routine. On top of that, the checker was still receiving routine calls from the rest of the system, grinding the whole system to a halt.

From the user viewpoint, their only friend on Facebook was a message saying that there was a “DNS error”. For Facebook’s IT team, it meant a few red faces in their new green data centre.

The error-checker, unsurprisingly, found that too to be in error and so an infinite loop began. A classic case of a developer not thinking outside the box and a literal comedy of errors resulting from it.

“The way to stop the feedback cycle was quite painful,” Johnson wrote, “We had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”

Facebook engineers have yet to provide a fix for the condition, In the meantime, the reconfiguration module has been switched out. Presumably, Facebook executives have crossed their fingers that this will not adversely affect the system again.

Johnson’s missive ends: “We apologise again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.”

It is the worst outage that Facebook has had in the past four years but it is also the second in two days. Yesterday’s problem was a lot shorter, affected fewer people and was put down to issues at a third-party networking provider.


Eric Doyle, ChannelBiz

Eric is a veteran British tech journalist, currently editing ChannelBiz for NetMediaEurope. With expertise in security, the channel, and Britain's startup culture, through his TechBritannia initiative

Recent Posts

Toshiba Axes 4,000 Staff In Post-Delisting Restructuring Operation

Workforce blow. Newly privatised Toshiba has embarked on a 'revitalisation plan' that will entail the…

12 hours ago

European Union Opens Child Safety Probe Into Meta

European Commission opens an official child safety investigation into Facebook and Instagram-owner Meta Platforms

13 hours ago

Apple Store Workers Vote To Strike Over Contract Talks Delay

Workers at unionised Apple store in Maryland vote to authorise first ever strike, after delays…

17 hours ago

Business Intelligence: Next-Generation Data Analytics

Explore how cutting-edge technologies are reshaping decision-making, driving innovation, and propelling businesses into the data-driven…

19 hours ago