Fastly Blames Global Outages On Software Bug

Oops Sorry Fail - Shutterstock - © Gunnar Pippel

Global outage of major websites on Tuesday was caused when an expected customer change, triggered an undiscovered software bug

US cloud services provider Fastly has provided a brief explanation for the major outage around the world yesterday of leading websites and online services.

On Tuesday, a large portion of the Internet was knocked offline for many hours, after an issue with Fastly, which provides key services to many websites.

Indeed, so serious was the problem, it took down major websites such as Amazon, the UK Government, CNN, Reddit, the Guardian etc offline.

The affected websites refused to load and users instead were confronted with a range of error messages, usually “503 Service Unavailable.”

Fail 2 - ShutterStock: © kaarsten

Fastly explanation

Nick Rockwell, senior VP of engineering and infrastructure explained in a blog post what went wrong and caused the major outage.

“We experienced a global outage due to an undiscovered software bug that surfaced on 8 June when it was triggered by a valid customer configuration change,” Rockwell wrote. “We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95 percent of our network was operating as normal.”

“This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them,” he added.

But what exactly happened?

“On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances,” Rockwell wrote.

“Early 8 June, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85 percent of our network to return errors,” he added.

He provided a timeline of the outage with initial onset of global disruption being registered at 09:47 (UTC) and the bug fix deployment began at 17:25.

“Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers,” said Rockwell. “We created a permanent fix for the bug and began deploying it at 17:25.”

Full post mortem

Rockwell said Fastly is deploying the bug fix across its network as quickly and safely as possible.

Fastly will also conduct “a complete post mortem of the processes and practices we followed during this incident,” he wrote. “We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.”

“Even though there were specific conditions that triggered this outage, we should have anticipated it,” admitted Rockwell. “We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority.”

“We apologise to our customers and those who rely on them for the outage and sincerely thank the community for its support,” he concluded.

CDN concerns

Despite the apology, this major outage that impacted large swathes of the Internet will likely trigger questions and follow up actions, so firms can improve resilience going forward.

Some experts, such as Toby Stephenson, CTO at Neuways, said the incident highlights the reliance of many of the world’s biggest websites on content delivery networks (CDNs) such as Fastly.

“As there are so few of these CDN services, these outages can occur from time-to-time,” said Stephenson. “By using these CDNs to push content to readers, these websites are usually fast and responsive, but on this occasion they have been left with egg on their collective faces.”

Simple mistakes

Another expert noted criminals will have taken note how an issue with CDNs can have such a significant impact on critical web infrastructure.

“With so many websites funneling through just a small number of content delivery networks, CDNs, it highlights the sheer scale of what they signify in terms of internet infrastructure and the pressure on them to withstand an outage or attack,” said Jake Moore, cybersecurity specialist at ESET.

“The impact from the Fastly situation will hopefully make procedures and restoring functions more streamlined and positively more proactive,” said Moore.

“Information security professionals are well prepared to expect the unexpected but even the most simple of mistakes can have huge consequences,” said Moore. “Simulations help relieve the pressure in a live situation but even with protocol lined up it would have been a long hour reconfiguring the mishap.”

“Time is money and never so much as on the internet,” said Moore. “The financial impact will have been catastrophic every single minute and exponentially creeping up so insurance claims are now a distinct possibility so it is likely to have gained attention from malicious actors wanting to capitalise on any potential vulnerabilities on offer going forward.”

“CDNs are part of the internet’s critical infrastructure and if threat actors hadn’t already cottoned on to this as a direct attack vector to bring down the internet, they will now after monitoring yesterday misfortunate events,” warned Moore.