We can all learn from how NASA plans for failures, says Sean Michael Kerner
Disasters and equipment failures can happen at any time, anywhere, and enterprise IT administrators need to properly prepare for them. Over the Christmas period, NASA fixed an equipment failure aboard the International Space Station (ISS), and while it operates in a very different environment from data centres here on Earth, its operations can serve as a guide to terrestrial best practices.
NASA astronauts Rick Mastracchio and Mike Hopkins exited the ISS on 21 December for a five-and-a-half-hour spacewalk to remove a faulty ammonia pump. On 24 December, the two astronauts took another spacewalk, this time installing a new ammonia pump to restore the ISS to full operations.
Spare parts in space?
What’s interesting to note here is that the new ammonia pump was already aboard ISS as a spare part. In the hostile environment that is space, redundancy isn’t an option, and spare parts aren’t easily sourced from a remote location. In the case of the spare ammonia pump, there’s also the question of how NASA and its ISS partners could have ferried a new ammonia pump to the station. Much of the ISS, including the ammonia pumps, was originally carried to space by way of the NASA shuttle fleet, which was decommissioned in 2011 with the final flight of the Shuttle Atlantis.
From a disaster recovery and redundancy perspective, NASA and its ISS partners had to plan from the beginning to have lots of options for repair and replacement of station components. Simply put, without the on-board ability to deal with certain types of equipment failure, the ISS would not be the success it is today and lives would be at risk.
Bringing the same message down to Earth, data centres and even branch IT and small offices can learn from NASA’s example. While humans on Earth likely don’t need to keep an extra ammonia pump onsite, it does make sense to have other types of spare equipment on premise.
Mission-critical servers and networking components can and should have redundant power supplies and fans for cooling. Power supplies and fans do break down and, even here on Earth where an extra power supply or fan can easily be sourced, it still takes time, which a mission-critical environment likely can’t afford.
Automatic failover is another commonly deployed feature in enterprise IT today. Clustered and mirrored server deployments that automatically take over for a failed component is a must-have in modern data centres.
Actually keeping extra equipment on hand, like NASA does, might seem like a luxury, but it also makes sense. For smaller branch and office IT environments, simply keeping an extra (perhaps older) Wi-Fi access point or router on hand for emergencies isn’t a bad idea. In the modern era, where the cloud exists for backup and application delivery, it’s important to remember that you still need access to the cloud and you still require some form of on-site or mobile equipment to do that.
Planning for failure means that you have options. Without redundancy and spare parts, equipment failure is an option that is more likely than not.
Sean Michael Kerner is a senior editor at eWEEK and InternetNews.com. Follow him on Twitter @TechJournalist.
Are you a success at failures? Try our IT great mistakes quiz!
Originally published on eWeek.