Fault Tolerance (Re)Discovered

Fault tolerant virtualisation technologies can be hardware or software based but they don’t necessarily offer the same level of protection for business-critical processes.

Some observers will point to Hewlett Packard’s NonStop systems as the epitome of software-based fault tolerance. No argument there. NonStops are excellent machines, living on as a proprietary answer to the needs of an existing installed base. New software-based FT solutions on x86 platforms come nowhere close to the HP system’s sophistication, complexity, or high cost. Nor are they up to the task of providing continuous availability for most business and mission-critical data centre applications that people bet their businesses on every day.

The state of the art today for software fault tolerance is linking two industry-standard x86 servers together with cable and software (or virtual machines mirrored by software across two, preferably three, identical x86 servers) so that they run in virtual lock-step, similar to the way FT hardware does, and deliver five-nines uptime. Applications and OSs must be licensed on each physical server.

Software fault tolerance: the good and bad

There are some advantages to the software approach. Depending on the solution chosen, an end-user can limit the number of vendors it does business with. Also, because of its record/replay origins, virtual FT has the ability to work through something called Heisenbugs, which is a programmer’s term for a software bug that alters its behaviour when you try to isolate or examine it.

Unlike hardware, however, software fault tolerance has limitations that, in the world of critical business computing, may trouble users. For example, transient hardware errors can crash a system and there is nothing to prevent the error from propagating to other servers or across the network. When an outage does occur, determining the root-cause to prevent a repeat of the problem is not an option. Many high-value applications are not fond of latency, such as when they move from one side of the pipe to the other.

Most important, though, software-based FT lacks symmetric multi-processing (SMP), which means applications cannot scale beyond a single core per server. So, in a two- socket server powered by quad-core processors, an application running in FT mode is restricted to the compute power of just one of the eight server cores. Further, processor manufacturers are engineering virtualisation capabilities into powerful new products that will be grossly under-utilised in this scenario. Despite assertions that all applications will run in a software-fault-tolerant environment, physical or virtual, many true business-critical and mission-critical applications are simply too demanding to function properly, if at all.

This can hardly be dubbed full-function fault tolerance. Supporting such light-weight workloads as software does, is better described as “fault tolerance lite”. It is very unlikely these technological shortcomings can be overcome any time soon.

Yes, by the narrowest of definitions, these software solutions are fault-tolerant, just as Hurricane Katrina could have been described as inclement weather; both statements are correct but neither captures the true nature of the situation. Delivering continuous availability – mission-critical application availability – requires more than saying you have fault tolerance. Continuous availability demands a combination of fault-tolerant hardware, software and, just as important, customer support that doesn’t quibble over who owns a problem.

Denny Lane is Stratus Technologies’ director of product management and marketing.