Designing a Fault Resilient Phone System

by Brian McConnell

One of the greatest benefits of owning a PC-based telephony server (besides the extra features they offer) is the freedom to decide how to plan for and how much to spend on fault tolerance. Your telephone system is obviously one of the most important parts of your business. Yet, the vast majority of business owners have little or no idea how their phone systems work, and how to fix them when something goes wrong.

PC-based systems, if designed and managed properly, can be made as reliable, if not more so, than the most expensive proprietary phone systems on the market. Why? For a number of reasons:

  • PC technology has advanced considerably in recent years, with advances in both operating systems (such as Windows NT) and fault tolerant components (such as Redundant Array of Inexpensive Disks, or RAID).
  • PC-based telephony servers are largely composed of generic components which can be quickly and cheaply replaced.
  • These systems can often be managed by somebody in house, meaning the response time when a component fails is the length of time it takes this person to walk across the office.
When you look at the overall picture, including the reliability of the components, availability of low-cost spare parts, and ease of use and maintenance, these systems give traditional PBX systems a serious run for their money.

This article explains how, using some common sense rules and fault tolerant technologies, you can create your own telephony server which will be as reliable as the most expensive proprietary system on the market. This article can be considered two separate articles, one on fault tolerance (i.e., designing a system so it won't fail), the other on disaster preparedness (i.e., mapping a strategy for responding quickly to a system failure).

The first, and most important thing to understand is that no system is 100% reliable. This is a simple fact of human existence. Something is always screwed up somewhere in any reasonably complex system. If you believe any vendor who promises that their system will "never fail," I'd be glad to sell you the Golden Gate Bridge for $50,000 (offer expires April 1st).

Fault tolerance (designing systems to fail gracefully)
Your body is an excellent example of fault tolerant design. Throughout your body, you'll find redundancy everywhere. Redundancy is the key to fault tolerance. The basic idea behind redundancy is a simple one. For any system which is critical, always have one or more backups which can take over if it fails.

PCs are built using many different low-cost components. Some of these components are not critical (i.e., if they fail, the system will continue to function. A floppy drive is a good example of this type of component. Others (i.e., CPU, hard drive, power supply) are critical to the continued operation of a server.

PC components can be divided into two general categories: components with moving parts, and components without moving parts. Solid state components such as memory, the CPU and other circuitry are statistically much less likely to fail (provided the server is properly protected from power supply fluctuations) than moving parts such as disk drives and power supply fans.

Of all the components in a typical server, there are a few which are most likely to fail: the power supply, hard disk drives, and CPU fans. Solid state components can still fail, but if the server is protected from external power surges and extreme environmental conditions, this is not likely to occur.

Fault tolerant storage media (RAID arrays)
A hard disk crash is every system administrator's worst nightmare. Restoring hundreds of megabytes of data from tape is a time-consuming task. Better to avoid a crash altogether.

RAID arrays provide an excellent fault tolerant storage medium for small and large servers. A RAID array operates on a simple principle. Rather than saving data in one location on one disk, a RAID array stores data in many locations on many disks. So, when a drive in a RAID array consisting of 10 disks fails, the array as a whole continues functioning. The faulty unit can be swapped out with no disruption to users, and so a formerly nightmarish occurrence becomes a non-eventful maintenance task.

NOTE: Most of the high-end PC operating systems such as Windows NT, OS/2, NetWare, and UNIX support RAID arrays. The best type of RAID array to use, particularly for Windows NT servers is a SCSI compatible array. A SCSI compatible array appears to your server as one single hard drive from which you can boot your entire system. Other RAID solutions come with their own disk controllers, and work in a similar way, but require that you install special drivers in order for the array to be visible to the operating system.

Fault tolerant power supplies
Power supplies are another common cause of problems, for 2 reasons: 1) they contain moving parts, 2) they are exposed to the outside power network (often the source of power spikes and surges).

When the PC power supply fails, so does the entire PC. However, unless the power supply was struck by a voltage spike or surge, the PC will generally function normally once the power supply is replaced.

To work around this source of potential trouble, buy a server with 2 swappable power supplies. If one of the power supplies fails, the other will most likely continue functioning while you replace the faulty unit. As with RAID arrays, the system continues functioning while you swap out the flaky component.

Fault tolerant operating systems and CPUs
Just as PC components have become more reliable, so too have PC operating systems. Windows NT is an excellent example of how operating systems have matured. Windows, once derided as a basketcase operating system, has evolved to become a much more resilient platform. Besides isolating applications from each other (to prevent one application from freezing the entire system), Windows NT directly supports many fault tolerant technologies (i.e., RAID arrays, power supplies, etc.).

Tips on building a fault tolerant PC
A fault tolerant PC is different from a regular desktop PC in several key respects:

  • Passive backplane design - In passive backplane PCs, the CPU and motherboard are inserted just like a PC expansion card. This makes it easy to swap the CPU and motherboard in seconds.
  • Redundant power supply - Hot swappable, redundant power supplies insure that the system continues functioning if one power supply fails.
  • Support for RAID arrays - Add a SCSI compatible RAID array for fault tolerant storage media.