Facebook services are now operating as normal, after the platform blamed a “faulty configuration change” for a six-hour outage on Monday afternoon and evening.
The outage was so severe, it prevented the platform’s 3.5 billion users from accessing its social media and messaging services including, WhatsApp, Instagram and Messenger. The outage is the largest ever tracked by Downdetector.
Problems began soon after 4pm BST, and continued through to 11pm BST on Monday evening.
At the time, there was speculation that the outage could be down to an error with DNS (domain name system) for Facebook sites.
Either that or a very serious problem at its data centre facilities.
The outage comes just days after Facebook chief technology officer (CTO) Mike Schroepfer announced his intention to step down from the role.
But now Facebook in a blog post revealed that a “configuration change” was to blame.
However the firm did not did not reveal who executed the configuration change, and whether it was a planned change.
“To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by today’s outage across our platforms,” blogged Facebook. “The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.”
“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication,” said Facebook. “This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt.”
“Our services are now back online and we’re actively working to fully return them to regular operations,” said the firm. “We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.”
Facebook then apologised to all its users, who for a time at least, had to go ‘cold turkey’ and use other social platforms such as Twitter or YouTube.
Several unidentified Facebook staffers told Reuters that they believed that the outage was caused by an internal mistake in how internet traffic is routed to its systems.
The severe outage at Facebook has drawn reaction from industry figures.
“On October 4th, between approximately 15:40 UTC – 22:45 UTC, Facebook suffered one of the most severe and prolonged outages on record for a major application provider in terms of breadth and duration as Facebook, Instagram, and WhatsApp were offline and unavailable globally for more than seven hours,” noted ThousandEyes, now part of Cisco in a blog post.
“While the DNS failures could have caused the apps to go offline, Facebook’s large-scale BGP route withdrawals precipitating the incident, along with other signals, point to the possibility that the issue impacted Facebook more broadly.
Another expert said that while the outage was not down to a cybersecurity incident, it highlights how human error or a software bug can bring down a global online operation.
“Outages are increasing in volume and can often point towards a cyber-attack, but this can add to the confusion early on when we are diagnosing the causes,” said Jake Moore, the former Head of Digital Forensics at Dorset Police and now cybersecurity specialist at ESET.
“As we saw with Fastly in the summer, web-blackouts are more often originate from undiscovered software bug or even human error,” said Moore. “Although these are increasing in frequency and require more failsafes in place, predicting these issues is increasingly more difficult as it was never thought possible before”.
Another expert pointed out that even huge organisations such as Facebook can be undone by not focusing on increasing resilience.
“Last night’s Facebook outage shows us no company’s IT infrastructure is too advanced to fail – even those providing services to over 3.5 billion people worldwide,” explained Ross Gray, CEO, Cloudsoft. “When a company goes down, a quick recovery is vital, and Facebook’s experience revealed that having a single service provider can make that difficult.
“This ultimately serves a reminder as to why the EU is developing new regulations around Digital Operational Resilience – commonly known as DORA – and why the UK is taking similar steps, too,” said Gray. “The resulting chaos will no doubt be weighing on the minds of many involved in writing these regulations and those set to be affected by it.”
“The outage has also thrown into sharp relief the complex network of functions and services reliant on the availability and resilience of a single service provider,” said Gray. “Many users, for example, reported to being unable to access internet-connected smart devices like smart TVs and thermostats – services not provided by Facebook, but accessed via its credentials, which poses wider concerns to other brands and online service providers.”
“Given how ingrained platforms like Instagram and Facebook are in the fabric of our economies, the operational resilience and reputation of a firm depends on the uptime of its IT systems,” Gray concluded. “It’s important, too, from a business perspective: Facebook stock closed nearly 5% lower yesterday, wiping $47 billion off the company’s value.”