In storm-tossed Washington, Wayne Rash’s disaster recovery plan brought his data centre back to life – but there was no on to talk to!
Calamity descended from the skies around Washington on June 29 in the form of a derecho, a type of weather system so rare most people have never even heard of it. This unusual complex of extremely severe weather had never been known to cross a range of mountains such as the Alleghenies. But this time it happened, and disaster planning went out the window.
Amazon’s huge data center near Dulles International Airport, fully redundant in itself, and served by redundant backup power and redundant power grids, redundant network access went down under the combined onslaught of massive power outages, massive Internet outages, phone line outages and cell system outages. Not only did everything go down, but nobody could call for backup. And, of course, even if the staff had known that this event was happening, they couldn’t have traveled there anyway. Most of the roads were blocked.
You can’t prepare for everything
While we often preach the gospel of preparedness, there are disasters for which no one could prepare. When weather this violent appears out of nowhere, with no warning and no forecasts, there is only so much that anyone or any institution can do. The fact that Amazon was able to get back online and have all of its affected customers fully restored by the next morning was remarkable.
But Amazon was one of the few that managed this. For smaller organisations with fewer resources this calamitous blow simply took them out. Many of those companies remained down as this was written on 2 July – and some will never recover.
Of course, some of those smaller organisations didn’t have disaster plans and were simply left hanging. Some did have plans, but they weren’t tested, and when push came to shove, didn’t work. And some were in place, tested and should have been enough, but just like with Amazon, the planners couldn’t plan for everything.
In my own company, which houses the test lab that produces those eWEEK reviews you see from time to time, I thought I’d planned for anything short of the Mayan Apocalypse or a slightly more probable world-ending asteroid strike. I’d even tested the lab using the backup generators, communicated using the backup Wi-Fi hotspot and made plans for the air conditioning to be out.
But in the case of the lab, configuration changes had crept in since the last time I calculated the electrical loads and I’d never tested the latest configuration. Worse, I’d assumed that the T-Mobile cell near the lab would keep running for at least a few days after losing power, since it had always done so in the past.
So the derecho came in the dark of night. The first hint was the flicker of lightning off to the northwest. Then a storm more violent than anything I’d ever seen before slammed the area. This was worse than the hurricanes I’d experienced, including one off the West Coast of Africa that was my previous high point when it came to weather-related anxiety. In 45 minutes, it was gone and so was the power, the Internet service, the phone service and the previously reliable cell tower.
But I got the generators started and began bringing up the lab infrastructure. One by one, the switches and servers came alive, the whir of the fans and the flickering of the lights reassuring me that all was well. Then I started up the HP server that handles the Domain Name System (DNS), the Dynamic Host Configuration Protocol (DHCP) and directory services. The low-voltage alarms started going off one by one. I didn’t have enough capacity to run the lab, despite my previous tests.
So I shut down the servers and the other computers, and finished bringing up the infrastructure. I had capacity for that and everything ran, but I was approaching the total capacity of the generators, and that’s never a good thing. But that’s when I found that it didn’t matter. My lab might be operational, but it couldn’t communicate with the outside world because nothing else was operational. Being able to run when the rest of the world is down doesn’t really help much—especially when you realise that you’re going to have to buy another generator and set up load sharing.
Actually, I’ll have to buy two more generators for full N+1 capability. But in the meantime, I’ll have to also remember that I have to run tests of the entire system more frequently, especially after I add more servers, new switches or network management equipment. I hadn’t gotten around to that, and it cost me.
But in this case, all of the planning wouldn’t have made any difference. As I looked out at the hazy heat that had brought all of this about, the one thing that kept popping into my head (right after the desire for a nice cold beer) was the words of Scottish poet Robert Burns:
“The best-laid schemes o’ mice an’ men
Gang aft agley,
An’ lea’e us nought but grief an’ pain,
For promis’d joy!”
Thanks for the reminder, old Rabbie. Tonight, I’ll have a wee dram of Scotland’s best in your memory and to remind me that we can’t plan for everything.