Categories: CloudPAASWorkspace

VMware Engineer Slip Caused Cloud Foundry Outage

VMware has blamed human error for an outage that affected its Cloud Foundry platform-as-a-service offering on 26 April, saying that the outage was the result of a mistake in engineers’ preparations to avoid future service interruptions.

The 26 April outage followed a previous interruption on 25 April, which resulted from a partial outage of a power supply in a storage cabinet, according to VMware official Dekel Tankel.

Cloud bugs

The outages followed highly publicised problems with Amazon Web Services (AWS) that lasted several days and affected a number of websites. VMware’s offering is much newer than EC2, having been announced on 12 April, and is still at the beta-testing stage.

The first Cloud Foundry outage occurred at 5:45am and lasted through to the afternoon, according to Tankel.

“Existing applications were not impacted by this event and continued to operate normally,” he wrote. “The folks most impacted by this event were the developers who received their access credentials the night before. They could not log in until 3:30pm when the system health and storage connectivity was fully restored to 100 percent availability.”

While the outage is not a “normal event”, Tankel said it is “something that can and will happen from time to time.

“In this case, our software, our monitoring systems, and our operational practices were not in synch… the net result is that the Cloud Controller declared a loss of connectivity to a piece of storage that it needs in order to process many control operations,” Tankel continued. “Once the system had entered this state, it took us several hours to validate that we had no loss of data and that the storage cabinet was operating correctly and at full reliability and redundancy.”

As a result of the first outage, VMware decided to develop a procedure for detecting, preventing and recovering from such events in the future – an “operational playbook”.

“This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed,” Tankel said. “Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.”

Full outage

Tankel said that during this second outage, all applications and system components continued to run.

“However, with the front-end network down, we were the only ones that knew that the system was up,” Tankel wrote. The front-end system was fully restored by 11:30am, he wrote.

Amazon said last week that most of the websites that were taken offline by the AWS outages were now back up and running.

The outages began on 21 April at roughly 9.40am BST, after an Amazon Web Services (AWS) data centre in Northern Virginia caused disruptions in its EC2 cloud hosting service, which in turn was said to knock thousands of websites offline.

This included the social news website Reddit; the Twitter toolbox Hootsuite; the Q&A website Quora; and the location-based social networking website Foursquare.

Whilst the problems were said to have persisted for some websites for a number of days, others came back online relatively quickly.

Matthew Broersma

Matt Broersma is a long standing tech freelance, who has worked for Ziff-Davis, ZDnet and other leading publications

Recent Posts

UK CMA Seeks Feedback On Microsoft, Amazon AI Partnerships

British regulator invites feedback on major partnerships Microsoft and Amazon have struck with smaller AI…

33 mins ago

Google Fires More Staff Over Israel Protest

Another 20 staff have been fired by Google over Israel protest and their “completely unacceptable…

2 hours ago

Australian PM Hits Out At Elon Musk Over Knife Attack Video

Censorship row brewing down under, after the Australian Prime Minister calls Elon Musk an 'arrogant…

3 hours ago

US SEC Seeks $5.3 Billion Fine From Terra’s Do Kwon

Financial regulator asks New York judge to impose $5.3 billion in fines against Terraform Labs…

3 hours ago

Microsoft Launches Smallest AI Model, Phi-3-mini

Lightweight artificial intelligence model launched this week by Microsoft, offering more cost-effective option for Azure…

7 hours ago

US Senate Passes TikTok Ban Or Divestment Bill

ByteDance protest falls on deaf ears, as Senate passes TikTok ban or divest bill, with…

8 hours ago