How to transmit and receive information using vocal cords, throat, tongue and ears

Table of Contents

Introduction
Robustness: transformations the signal must survive
Things we can use to produce speech
Things we can use to process speech
A first shot at a protocol

Introduction

In the series “the world as seen by a computer geek” (see the earlier installment “Amazing DNA”), we present an article on speech as an information encoding protocol.

First we set out the preconditions, in other words, what kind of transformations our signal must be able to undergo while still staying intelligible. Then we look at how speech can be modulated. We finish by examining what equipment is used to receive the signal. In the end we synthesise all this into a suggested protocol.

It should be noted that speech actually transmits three kinds of information: the actual verbal content, non-verbal meta-data enhancing the meaning of the words sent and an authentication signature, in other words, a way to recognize the speaker.

We'll be focussing on an anonymous neutral signal that only conveys words.

Robustness: transformations the signal must survive

To be of use, our encoding should be robust in the face of typical daily life distortions. These include:

Echo - our signal is highly likely to have a number of weaker repeats superimposed on it
Complete removal of signal below 350Hz and above 3500Hz - like on a phone line
Staggering amounts of additive wide band noise - like wind, or cars, or machinery
Spurious but similar signals - different speech signals in the background
Strong sinusoid interference, like a train whistle
Highly variable filtering - when somebody is talking in another direction, the relative strength of different frequency changes massively from when the sound is transmitted directly
Removal of carrier. This is where it gets odd. Speaking without carrier is called 'whispering' - which may form an important clue on how speech works.

Then there are some non-goals, but which appear to be met by human speech information encoding:

Changes in speed (and hence, pitch) have little influence on intelligibility, over a range from 30% slower to 130% faster (more or less).
Changes in phase for different frequencies, as caused by electrical filters (such as found in phones, equalizers etc etc).
Inserted repeats. To explain, in modern music, speech is often pitch-shifted, which entails speeding up the signal, but keeping the actual temporal length equal. To do this, speech is chopped in short segments that are each repeated to make up for the faster playback.

This all seems nearly impossible but we are saved by one thing: the very low data rate of speech. It has been estimated that each word of speech encodes 5 bits of content, which would put the effective data rate of speech around the order 50 bits/second - give or take a few factors.

This may seem very low but there is some evidence that in many cases the receiving station ('the listener') employs large amounts of content to properly decode speech - in other words, the listener is not able to discern random words as well as sentences that make sense.

Things we can use to produce speech

The vocal cord is the source of our carrier, which as noted above, is optional for intelligible speech. It does probably however play a major role in defeating noise and separating the proper signal from competing speakers. When the vocal cord vibrates at a frequency, it generates a large number of harmonics at integral multiples of the harmonic.

These harmonics are then either amplified or attenuated by the shape of throat, tongue, cheeks and whatnot - this is a multiplication of the signal.

In addition, other parts of the breathing, eating and drinking apparatus have been appropriated by the speech process to add various rasping, clicking and noise components to the sound produced. These form the consonants.

Things we can use to process speech

Our ears are little marvels. Whereas most of the components of our body, with the exception of the brain, pale in comparison with more dedicated animals, our ears are top notch. Our sight and sense of smell are puny when put next to those of an eagle or dog, but our ears are nothing to be ashamed of.

In fact, if our hearing were slightly more acute, we would hear individual hydrogen molecules bounce of our eardrums. Bats and dolphins are in a league of their own of course, but they use their ears as sonar installations. Their ears aren't a lot more sensitive though, the difference is mostly in the processing.

Within our ears are so called hair cells which vibrate with the sounds that have been received by a complex apparatus. So complex that as yet we have only a weak understanding of how it all works.

Part of it is what electrical engineers would call 'impedance matching'. The eardrum transmits sounds in air to vibrations in a fluid. This fluid has a vastly different speed of sound, which would normally cause a lot of the energy to be reflected. Not so in our ear, the smallest bones in our body are responsible for a 'gear shifting installation' which translates the movements until they are accepted for nearly 100% by the fluid.

In this fluid is a membrane to which hairs are attached. These hairs also have little muscles which are most likely involved with compressing the staggering range of energies to which we are sensitive into something that the brain can deal with. Radio engineers know this problem all too well, the natural dynamics of sound are too much for FM transmission, and need to be compressed into a smaller range. This is a lot like it.

This membrane to which the hairs are attached has differing properties along its length, making hairs at the beginning vibrate at lower frequencies than those near the end - in effect performing a Fourier transform. But don't feel all too good about this. Indications are that the frequency analysis beats our technology's best efforts both in time and in frequency resolution.

This signal is then transported over thousands of nerve fibers to the brain, each fiber representing a frequency range.

I do want to stress that the above is highly sketchy - both because I don't know any better and often because nobody knows any better. Among the unexplained things are the fact that the ear produces noise out of its own accord, but also generates an echo when fed a click - but too late to be a natural echo, it is actively generated.

And to go completely overboard, if you feed the ear two tones differing slightly in frequency, it returns with a tone of yet a third frequency, which relates to the frequency difference of the original tones. This is in fact used to diagnose certain hearing disorders in infants. Nobody knows why this works, but theories abound. This fascinating phenomenon is called “distortion product oto-acoustic emission”.

Summarising: our ear performs a frequency analysis of sound, which is fed to the brain.

A first shot at a protocol

Given what we know so far, let's try to design a protocol that meets all these design goals, but is more limited than real speech in many ways, but does retain its essence.

The idea is to transmit 2 bits per 'token', in other words, a 1 baud signal at 2 bps, if we restrict ourselves to 1 token per second.

From what we've seen so far it is more than fair to operate in the frequency domain, so we'll do just that. The simplest analogue is of course DTMF, the protocol used to encode digits dialled on the phone. This fails however in that it does not survive even the slightest tinkering with its time base.

... to be continued