ANALYSIS: BA’s IT outage could have been caused by more than just a power surge
British Airways still appears to be struggling to get its operations back in order after a power outage caused an IT failure on Saturday.
Despite having two data centres and a significant IT infrastructure, BA said that a surge in electricity knocked its data centre near Heathrow offline and caused disruption in its check-in and operating systems. It noted that when power was restored it caused damage to the servers further adding to the outage.
“There was a total loss of power at the data centre. The power then returned in an uncontrolled way causing physical damage to the IT servers,” BA said in a statement. “It was not an IT issue, it was a power issue.”
However, other than blaming a power surge and adamantly declaring that the outage and the length of time the systems have been down was not due to outsourcing IT support from the UK to India to save costs, there has been little information from BA as to why its IT outage was so severe.
And the narrative is further muddied with the GMB union having claimed outsourcing has lead to a brain drain in IT expertise for the handling of such an IT outage and the execution of a robust disaster recovery plan.
Then there are reports noting that local electricity providers to the Heathrow area denied that had been any power surges, despite BA’s insistence that a power surge was to blame for knocking out their main data centre and backup systems.
Various comments from data centre experts and our own knowledge of levels of protection modern data centres have against power surges and outages, such as backup power supplies, surge protection and several uninterruptible power supplies (UPS) providing multiple levels of redundant systems should a UPS fail, make us question how robust BA’s data centre architecture and design is given how its backup failed.
A spokesperson from BA explained to Silicon that it is currently investigating the reason why its backup did not spin-up and mitigate a significant chunk of the chaos the outage has caused. However, that may take some time to come to light as the airline is currently focusing on getting its operations back to normal and its customers to their booked destinations.
Speculation for an explanation
Silicon‘s Roland Moore-Colyer went on BBC Click Radio to discuss the BA outage, and while no clear conclusions were uncovered, there was definitely some solid speculations that could explain why BA has ended up in a rather costly and embarrassing situation.
The first being that BA simply did not have the right technical skills or managerial nous to have a strong and reliable disaster recovery plan in action. While many cloud and managed services providers offer such services they can be expensive, and BA has experience in running complex IT infrastructure so perhaps thought it could get by on its own in-house technical knowledge.
The second point is that BA has huge amours of IT infrastructure, some of it older than others, meaning that when multiple systems are knocked offline, ensuring data is correctly synced and up-to-date is no easy task when compared to companies using more modern and cloud-based infrastructure.
And the point that has been raised is whether BA would have coped better with such an IT outage, or if it would have suffered from one at all, if it had been using more cloud-based systems.
Given that Google, Amazon Web Services, and Microsoft Azure invest billions of dollars into building robust cloud services, Moore-Colyer suggested that perhaps BA would have been better pushing the infrastructure and data is could feasibly do so in the face of budgets and data regulations, into a major cloud platform.
However, it was pointed out that this is easier said than done. And despite major technology companies touting technology to facilitate the much lauded digital transformation, moving to the cloud is still a relatively novel concept on the whole.
For BA the IT outage has been a rather large disaster with the company likely to loose more in reputation than it will in compensation for passengers, serving as a lesson to other major organisations to ensure that they have enough systems in place and a robust strategy to handle power surges and IT outages, no matter how unlikely they may seem.
How much do you know about the cloud? Try our quiz!