An Introduction to Polycom Technology

by Jeffrey Rodman

Through the last 10 years, Polycom has developed the most advanced speakerphone technology available today. Systems available in 1992, unstable and hard to use, only approximated full-duplex performance, and wore price tags from $2,500 to $10,000. Today Polycom offers a wide range of products incorporating the Clarity by Polycom technology, matched to a variety of applications, for prices starting at a tenth of that. But even today, the question is often voiced, "what is so hard about building a speakerphone, of all things?"

Well, this is a reasonable question, given how simple the function seems: you talk, they hear; they talk, you hear. What could be easier? This paper is intended to explain some of the challenges involved in developing a premier audioconferencing system, and how Polycom has approached these challenges to produce Clarity by Polycom: a technology, or rather, a family of interlocking technologies, that exploit a high order of internal intricacy to produce the effect of external simplicity and transparency.

What is a speakerphone? It is something that acoustically links one or more users, through the open air, to an electronic communication medium, with no need to hold a handset, or wear a headset, next to the ear or mouth. A speakerphone contains something to hear with, something to make noises with, and something to control things with; jobs that are most commonly performed by a microphone, a speaker, and a keypad. The simplest way to use these elements is to then take the microphone and the speaker, and connect them to the telephone or communications line. If this is done properly, then the far-end talks and the sound comes out the near-end speaker; when the near-end talks, the microphone picks up the sound of this talking and sends it to the far end.

This sounds too simple to be true but it does, in fact, work. It works well, up to a point: this is exactly what happens in a normal telephone handset. The problem comes when the loudspeaker gets loud enough to hear farther than an inch or two away: the microphone then hears not only the person talking, but also its own loudspeaker. The result of this is much like that in a badly adjusted auditorium PA system: feedback and howling. And it occurs for the same reason—the microphone is hearing too much of the loudspeaker, and the signal just feeds back on itself. The person at the far end of this kind of phone call also hears echo, their own voice coming back after a fraction of a second's delay. This is a very unsettling effect in telephony systems.

To overcome these problems, conventional speakerphones have for years added a selector switch that allows only the microphone or the loudspeaker, but not both, to be connected at the same time. The switch is controlled automatically. Whichever end is talking louder gets control, an approach that solves the howling and echo problems. But it now introduces the problem of “clipping.” Because of this "loudest noise wins" strategy for setting the switch, either end can now block out the other completely; coughs and dropped pencils cut out important parts of the conversation. To compensate for this clipping, we learn to shout at the speakerphone when we want to be heard, to move stealthily and with great caution when we don't, and to ask, “what did you say?” and "can you back up? I missed that" a lot. The whole meeting is conducted in an uncomfortable, stilted fashion. Some companies have even institutionalized the use of the "mute" button commonly found on speakerphones, assigning to one attendee at each end of a speakerphone meeting the task of pushing this button each time it is the far end's turn to talk.

The optimal strategy for this problem has only become available during the past 10 years. This is built around a technique called echo cancelation, and it works by using a very fast, specialized computer to analyze the acoustics of a room, monitor the sound coming out of its own speaker, and then predict the echoes that will result, in order to eliminate them from the microphone signal. When done perfectly, this yields “full-duplex” operation, allowing the microphone and speaker to remain on all the time and conversation to proceed easily and naturally. Both sides can talk, interrupt, drop those pencils or cough those coughs, without impeding the flow of information. Because conversation is more natural, time spent is not nearly as fatiguing, and meetings can go as long as required. This effect has begun to change the way in which businesses hold meetings. Many businesses are moving from the short, uncomfortable “speakerphone call” to audioconferences in which conversation flows naturally, people are free to move about, and work gets done nearly as well as talking in person.

But as the reader has probably begun to suspect, there is more to a state-of-the-art audioconferencing system than meets the eye. Let us look at some of the challenges inherent in the functions described above.

Room echo simulation
It is not just the sound that comes directly from the loudspeaker to the microphone that is a problem, it is reflections within the room also. With walls, furniture, doors, and people in different places, every room has a different echo pattern and must be analyzed independently. Even the difference between a door open and closed can result in feedback and howling if not detected and compensated for. In addition, room environments change continually during meetings as people lean back and turn in their chairs, sip from coffee cups, push things around, open and close notebooks (excellent planar reflectors!), and so forth. Although these changes can seem small, they often create a big difference in the pattern of reflections, just as a tiny chip of mirror can reflect a lot of sunlight. So these changes also must be captured and compensated.

Earlier systems that measured the room response only once with a burst of noise at the beginning of the call quickly lost track of the room environment as people moved around, and became unstable. So while it is much more difficult to continually update this model than to do it just once, this kind of continuous updating, called "adaptive echo cancelation," is essential for trouble-free operation.

Clear microphone pickup
Although the human ear is not highly directional by itself, the brain works in conjunction with the 2 ears to separate sounds from room reverberation. This is why a person talking from across a room can sound much better than if heard on a conventional speakerphone: the ear may not be that good, but the brain, using techniques that are still not well understood, cleans up the signal. An electronic system, however, must depend instead on clever acoustical design to be able to send the clearest sound to the far end of a call. Careful attention to microphone frequency response, orientation with respect to users and table, noise level, consistency, sensitivity, and a variety of other factors, all play a part in crafting the clearest sound.

In some cases, multiple microphones are used in combination. The SoundStation Premier®, for example, uses 3 independent highly directional hypercardioid microphones, each with a full independent echo canceler. This allows the system to select sound only from the direction of the talker, which markedly cuts down on room reverberation. Microphones in Polycom's conferencing systems are also mounted in a pressure-zone-microphone configuration, which reinforces sensitivity in the direction of the talker while eliminating almost half of the ambient noise.

Because the conventional telephone signal has very limited bandwidth, from about 300 to 3300 Hz, it is essential to convey and reproduce all of this signal as transparently as possible. High-end loudspeaker designs, although still uncommon in most telephony systems, make a big difference in producing clear, pleasant sound that is not tiring to listen to. In addition to incorporating advanced acoustic suspension loudspeaker system designs, most of Polycom’s audioconferencing systems contain custom loudspeaker drivers optimized for wide bandwidth, high efficiency, broad dispersion, and plenty of loudness without distortion for conferencing applications. These are all factors that add extra cost, but are essential for transparent performance in an audioconferencing system.

Fast adaptation
Because echo cancelers, as described above, must create a model of the room before they are fully functional, the speed with which they can deduce an accurate acoustic model is an important factor. A well-designed echo canceler must be optimized for speed as well as accuracy, another function that places heavy demands on the computational engine. In Polycom's new systems, the audio environment is analyzed each 1/8000 of a second; each of these analyses averages more than 10,000 computations.

Redundancy and reliability
As we have all experienced, a meeting that is halted by problems with speakerphones is an expensive and embarassing disaster. Just as airplanes and spacecraft must operate redundant systems to assure continued operation, so must as complex a system as a high-performance audioconferencing unit. An integral strength of the Clarity by Polycom technology lies in its control and supervision processing, in which the basic elements of the system, as described above, are linked with numerous levels of management and coordination software to ensure that changing conditions are compensated for, that impending problems are detected and quickly corrected, and that the audioconferencing unit will thus continue to perform in a manner as transparent and invisible as possible.

So here we have the fundamental irony behind the best audioconferencing systems: that it takes a dense and carefully tuned concoction of acoustics, algorithms, electronics, mechanics, and user interface design to produce a simple-to-use, transparent channel for continued productive audio communications. This is the strength of the Clarity by Polycom technology: not one algorithm, but a suite of techniques and processes that are melded to create the optimal, focused, communication system.