Designing a Fault Resilient Phone System
by Brian McConnell
One of the greatest benefits of owning a PC-based telephony server
(besides the extra features they offer) is the freedom to decide how to
plan for and how much to spend on fault tolerance. Your telephone system
is obviously one of the most important parts of your business. Yet, the
vast majority of business owners have little or no idea how their phone
systems work, and how to fix them when something goes wrong.
PC-based systems, if designed and managed properly, can be made as reliable,
if not more so, than the most expensive proprietary phone systems on the
market. Why? For a number of reasons:
When you look at the overall picture, including the reliability of the
components, availability of low-cost spare parts, and ease of use and maintenance,
these systems give traditional PBX systems a serious run for their money.
PC technology has advanced considerably in recent years, with advances
in both operating systems (such as Windows NT) and fault tolerant components
(such as Redundant Array of Inexpensive Disks, or RAID).
PC-based telephony servers are largely composed of generic components which
can be quickly and cheaply replaced.
These systems can often be managed by somebody in house, meaning the response
time when a component fails is the length of time it takes this person
to walk across the office.
This article explains how, using some common sense rules and fault tolerant
technologies, you can create your own telephony server which will be as
reliable as the most expensive proprietary system on the market. This article
can be considered two separate articles, one on fault tolerance (i.e., designing
a system so it won't fail), the other on disaster preparedness (i.e., mapping
a strategy for responding quickly to a system failure).
The first, and most important thing to understand is that no system
is 100% reliable. This is a simple fact of human existence. Something is
always screwed up somewhere in any reasonably complex system. If you believe
any vendor who promises that their system will "never fail," I'd be glad
to sell you the Golden Gate Bridge for $50,000 (offer expires April 1st).
Fault tolerance (designing systems to fail gracefully)
Your body is an excellent example of fault tolerant design. Throughout
your body, you'll find redundancy everywhere. Redundancy is the key to
fault tolerance. The basic idea behind redundancy is a simple one. For
any system which is critical, always have one or more backups which can
take over if it fails.
PCs are built using many different low-cost components. Some of these
components are not critical (i.e., if they fail, the system will continue
to function. A floppy drive is a good example of this type of component.
Others (i.e., CPU, hard drive, power supply) are critical to the continued
operation of a server.
PC components can be divided into two general categories: components
with moving parts, and components without moving parts. Solid state components
such as memory, the CPU and other circuitry are statistically much less
likely to fail (provided the server is properly protected from power supply
fluctuations) than moving parts such as disk drives and power supply fans.
Of all the components in a typical server, there are a few which are
most likely to fail: the power supply, hard disk drives, and CPU fans.
Solid state components can still fail, but if the server is protected from
external power surges and extreme environmental conditions, this is not
likely to occur.
Fault tolerant storage media (RAID arrays)
A hard disk crash is every system administrator's worst nightmare. Restoring
hundreds of megabytes of data from tape is a time-consuming task. Better
to avoid a crash altogether.
RAID arrays provide an excellent fault tolerant storage medium for small and large
servers. A RAID array operates on a simple principle. Rather than saving
data in one location on one disk, a RAID array stores data in many locations
on many disks. So, when a drive in a RAID array consisting of 10 disks
fails, the array as a whole continues functioning. The faulty unit can
be swapped out with no disruption to users, and so a formerly nightmarish
occurrence becomes a non-eventful maintenance task.
NOTE: Most of the high-end PC operating systems such as Windows NT,
OS/2, NetWare, and UNIX support RAID arrays. The best type of RAID array
to use, particularly for Windows NT servers is a SCSI compatible array.
A SCSI compatible array appears to your server as one single hard drive
from which you can boot your entire system. Other RAID solutions come with
their own disk controllers, and work in a similar way, but require that
you install special drivers in order for the array to be visible to the
Fault tolerant power supplies
Power supplies are another common cause of problems, for 2 reasons:
1) they contain moving parts, 2) they are exposed to the outside power
network (often the source of power spikes and surges).
When the PC power supply fails, so does the entire PC. However, unless
the power supply was struck by a voltage spike or surge, the PC will generally
function normally once the power supply is replaced.
To work around this source of potential trouble, buy a server with 2
swappable power supplies. If one of the power supplies fails, the other
will most likely continue functioning while you replace the faulty unit.
As with RAID arrays, the system continues functioning while you swap out
the flaky component.
Fault tolerant operating systems and CPUs
Just as PC components have become more reliable, so too have PC operating
systems. Windows NT is an excellent example of how operating systems have
matured. Windows, once derided as a basketcase operating system, has evolved
to become a much more resilient platform. Besides isolating applications
from each other (to prevent one application from freezing the entire system),
Windows NT directly supports many fault tolerant technologies (i.e., RAID
arrays, power supplies, etc.).
Tips on building a fault tolerant PC
A fault tolerant PC is different from a regular desktop PC in several
Passive backplane design - In passive backplane PCs, the CPU and motherboard
are inserted just like a PC expansion card. This makes it easy to swap
the CPU and motherboard in seconds.
Redundant power supply - Hot swappable, redundant power supplies insure
that the system continues functioning if one power supply fails.
Support for RAID arrays - Add a SCSI compatible RAID array for fault tolerant