Fault Tolerance (Re)Discovered

Fault tolerant virtualisation technologies can be hardware or software based but they don’t necessarily offer the same level of protection for business-critical processes.


Like virtualisation, fault tolerant technology has been around for decades. Also like virtualisation, fault tolerance has been recently reinvented for the demands of modern business computing; industry-standard platforms, business continuity, application availability, server consolidation, end-to-end business process reliability, flexibility and rapid response. What’s old is new again.

Another interesting aspect is that, even though they developed along separate paths, virtualisation and fault tolerance are made for each other. Full-function fault tolerance makes continuous availability (CA) of (and within) virtualised IT environments possible. CA is becoming a critical concern among users, particularly as the technology matures and is seriously considered for critical data centre applications.

Fault tolerance or high avaialability?

Fault tolerant technology has always had its proponents and buyers. But, suddenly, it’s in the spotlight at industry conferences and getting media and analyst attention. It has been … well, discovered. New products coming to market have sparked discussion and raised awareness among technology consumers. As often happens when vendors realise they need to jump on a passing band wagon, definitions, features and functions get stretched out of shape to mask the shortcomings of marketers’ claims.

For example, fault tolerance is not a scaled measurement of one’s tolerance for downtime. Fault tolerance is the best time-tested technology for achieving continuous availability, or near-perfect 99.999 percent uptime (a.k.a. five-nines availability) in continuous, round-the-clock, processing.

How does this compare to high availability (HA)? Clustering is one approach to achieve high availability. HA technology is for recovering from failure by failing over (switching over to a standby system), and restarting applications on another server. HA can never be fault tolerant, therefore it ranks a category below FT in the availability stack. “Failover” and “restart” have no relevance in a discussion about FT, except to say they do not apply.

Hardware or software fault tolerance?

So in today’s parlance, what is fault tolerance? The answer depends on the type of fault tolerance … hardware or software. In some ways they are similar, but they have significant differences that need to be understood in order to choose the best approach for a particular need.

At its most basic, “hardware” fault tolerance is designed to prevent unplanned downtime and data loss. All components are duplicated, not just power suppliers or fans, and run in complete synchronisation so they appear as one logical server to the operating system and the application. Sophisticated logic and diagnostic software cross-check every operation. If something is amiss within the server, the diagnostics will identify the problem and, if necessary, remove the broken part from service while the rest of the server and the application continue to run completely unaffected. Often knocked for being pricey, entry Intel-based servers can be purchased for less than $13,000 (USD).

New FT products today are coming from companies with solutions in software, not hardware. This gets a bit tricky. Rather than ask if they can be defined as fault tolerant, which they can, the question is what workloads are they capable of tackling, and are they able to harness the true power inherent in virtualisation and processor technologies, which is doubtful.