Most system designers wouldn't think of using a standard desktop PC as the foundation for an effective HA system. Apart from the reliability issues arising from the hardware itself, the underlying software isn't meant for continuous operation. When desktop operating systems and applications need to be patched or upgraded, most users expect to reboot their machines. Unfortunately, they might also have become accustomed to rebooting as part of their daily operations!
But in an HA system, various software components may need to be upgraded on a live system. Individual modules should be readily accessible for analysis and repair, without jeopardizing the availability of the system itself.
In our view, effective HA systems must address the main problem — software faults — through a modular approach to system design and implementation. Based on a microkernel architecture, the QNX Neutrino RTOS not only helps isolate problem areas throughout the system, but also ensures complete independence of system components. Each component enjoys full MMU-based memory protection. And system-level modules such as device drivers benefit from the same isolation and protection as any other process. You can start and stop a driver, networking protocol, filesystem, etc., without touching the kernel. A microkernel RTOS inherently keeps the single point of failure (SPOF) number as low as possible.
QNX Neutrino High Availability Framework provides a reliable software infrastructure on which to build highly effective HA systems. In addition to support for hardware-oriented HA solutions (e.g., CompactPCI as well as custom hardware), you also have the tools to isolate and even repair software faults before they occur throughout your entire system.
For example, suppose a device driver crashes because it tried to write to memory that was allocated to another process. The MMU will alert the microkernel, which in turn will alert the High Availability Manager (HAM). A HAM can then restart the driver. In addition, a dump file can be generated for postmortem analysis.
Viewing this dump file, you can immediately determine which line of code is the culprit and then prepare a fix that you can download to all other units in the field before they run into the same bug. With a conventional OS, a rogue driver may run for days before the system becomes corrupted enough to fail — and then it's too late to identify the problem, let alone dynamically install an upgraded driver!
A HAM can perform a multistage recovery, executing several actions in a certain order. This technique is useful whenever strict dependencies exist between various actions in a sequence, so that the system can restore itself to the state it was in before a failure.
Equipped with the QNX Neutrino RTOS itself, as well as the special tools and API in the High Availability Framework, you should be able to anticipate the kinds of problems that are likely to happen, isolate them, and then plan accordingly. In other words, assuming that failure will occur, you can now design for it and build systems that can recover intelligently.